[slurm-users] Drain node from TaskProlog / TaskEpilog

Tue May 25 12:09:17 UTC 2021

Thanks to everyone for their help, much appreciated.

Seems to confirm that things would be much easier if I could just figure 
out a way to detect the issue from the prolog/epilog, rather than the 
taskprolog/taskepilog!

All the best,

Mark

On Mon, 24 May 2021, Brian Andrus wrote:

> [EXTERNAL EMAIL]
>
> Ah. I'll proceed under the scenario that there is a piece of hardware
> that is being tested and may lock up (The GPU in this case).
>
> If you are able to identify the issue is occurring from within the job,
> you should exit the job with an error or some signal to alert slurm (eg:
> a semaphore file). You can then use something like EpilogSlurmctld to
> recognize that and reboot the node accordingly.
>
> This is presuming the node needs a full reboot, which I am guessing
> affects the entire job. If you are able to do something like
> unload/reload the cuda drivers between tasks, that may be a way to
> continue the job while still 'fixing' the issue. That could be done in
> the TaskEpilog script (assuming your daemon user has permissions to do so).
>
> On 5/24/2021 8:56 AM, Mark Dixon wrote:
>>  Hi Brian,
>>
>>  Thanks for replying. On our hardware, GPUs allocated to a job by
>>  cgroup sometimes get themselves into a state requiring a reboot.
>>
>>  Outside the job, a simple CUDA program calling the API function
>>  cudaGetDeviceCount works happily. Inside the job, it returns an error
>>  code of 3 (cudaErrorInitializationError).
>>
>>  At present, I have a TaskProlog that prods this API function and
>>  emails me when there is a failure. It'd be nice if the nodes could
>>  drain themselves without administrator intervention, rather than
>>  continuing to run waiting jobs and so causing them to fail.
>>
>>  I can see a couple of ways to do it (e.g. sudo script in TaskProlog,
>>  or playing with the cgroup hierarchy outside of slurm), but was
>>  wondering if I had misunderstood the slurm docs and there was a
>>  simpler way.
>>
>>  Best,
>>
>>  Mark
>>
>>  On Mon, 24 May 2021, Brian Andrus wrote:
>>
>>>  Not sure I can understand how it can only be detected from inside the
>>>  job environment for a failed node.
>>>
>>>  That description is more of "our application is behaving badly, but not
>>>  so bad, the node quits responding." For that situation, your app or job
>>>  should have something that it is doing to catch that and report it to
>>>  slurm in some fashion (up to and including, kill the process).
>>>
>>>  Slurm polls the nodes and if slurmd does not respond, it will mark the
>>>  node as failed. So slurmd must be responding.
>>>
>>>  If you can provide a better description of what symptoms you see that
>>>  cause you to feel the node has failed, we can help a little more.
>>>
>>>  On 5/24/2021 3:02 AM, Mark Dixon wrote:
>>>>   Hi all,
>>>>
>>>>   Sometimes our compute nodes get into a failed state which we can only
>>>>   detect from inside the job environment.
>>>>
>>>>   I can see that TaskProlog / TaskEpilog allows us to run our detection
>>>>   test; however, unlike Epilog and Prolog, they do not drain a node if
>>>>   they exit with a non-zero exit code.
>>>>
>>>>   Does anyone have advice on automatically draining a node in this
>>>>   situation, please?
>>>>
>>>>   Best wishes,
>>>>
>>>>   Mark
>>>> 
>>> 
>>> 
>>> 
>> 
>
>
>