[slurm-users] Drain node from TaskProlog / TaskEpilog

Mon May 24 15:56:38 UTC 2021

Hi Brian,

Thanks for replying. On our hardware, GPUs allocated to a job by cgroup 
sometimes get themselves into a state requiring a reboot.

Outside the job, a simple CUDA program calling the API function 
cudaGetDeviceCount works happily. Inside the job, it returns an error code 
of 3 (cudaErrorInitializationError).

At present, I have a TaskProlog that prods this API function and emails me 
when there is a failure. It'd be nice if the nodes could drain themselves 
without administrator intervention, rather than continuing to run waiting 
jobs and so causing them to fail.

I can see a couple of ways to do it (e.g. sudo script in TaskProlog, or 
playing with the cgroup hierarchy outside of slurm), but was wondering if 
I had misunderstood the slurm docs and there was a simpler way.

Best,

Mark

On Mon, 24 May 2021, Brian Andrus wrote:

> Not sure I can understand how it can only be detected from inside the
> job environment for a failed node.
>
> That description is more of "our application is behaving badly, but not
> so bad, the node quits responding." For that situation, your app or job
> should have something that it is doing to catch that and report it to
> slurm in some fashion (up to and including, kill the process).
>
> Slurm polls the nodes and if slurmd does not respond, it will mark the
> node as failed. So slurmd must be responding.
>
> If you can provide a better description of what symptoms you see that
> cause you to feel the node has failed, we can help a little more.
>
> On 5/24/2021 3:02 AM, Mark Dixon wrote:
>>  Hi all,
>>
>>  Sometimes our compute nodes get into a failed state which we can only
>>  detect from inside the job environment.
>>
>>  I can see that TaskProlog / TaskEpilog allows us to run our detection
>>  test; however, unlike Epilog and Prolog, they do not drain a node if
>>  they exit with a non-zero exit code.
>>
>>  Does anyone have advice on automatically draining a node in this
>>  situation, please?
>>
>>  Best wishes,
>>
>>  Mark
>> 
>
>
>