[slurm-users] Drain node from TaskProlog / TaskEpilog
mark.c.dixon at durham.ac.uk
Mon May 24 15:56:38 UTC 2021
Thanks for replying. On our hardware, GPUs allocated to a job by cgroup
sometimes get themselves into a state requiring a reboot.
Outside the job, a simple CUDA program calling the API function
cudaGetDeviceCount works happily. Inside the job, it returns an error code
of 3 (cudaErrorInitializationError).
At present, I have a TaskProlog that prods this API function and emails me
when there is a failure. It'd be nice if the nodes could drain themselves
without administrator intervention, rather than continuing to run waiting
jobs and so causing them to fail.
I can see a couple of ways to do it (e.g. sudo script in TaskProlog, or
playing with the cgroup hierarchy outside of slurm), but was wondering if
I had misunderstood the slurm docs and there was a simpler way.
On Mon, 24 May 2021, Brian Andrus wrote:
> Not sure I can understand how it can only be detected from inside the
> job environment for a failed node.
> That description is more of "our application is behaving badly, but not
> so bad, the node quits responding." For that situation, your app or job
> should have something that it is doing to catch that and report it to
> slurm in some fashion (up to and including, kill the process).
> Slurm polls the nodes and if slurmd does not respond, it will mark the
> node as failed. So slurmd must be responding.
> If you can provide a better description of what symptoms you see that
> cause you to feel the node has failed, we can help a little more.
> On 5/24/2021 3:02 AM, Mark Dixon wrote:
>> Hi all,
>> Sometimes our compute nodes get into a failed state which we can only
>> detect from inside the job environment.
>> I can see that TaskProlog / TaskEpilog allows us to run our detection
>> test; however, unlike Epilog and Prolog, they do not drain a node if
>> they exit with a non-zero exit code.
>> Does anyone have advice on automatically draining a node in this
>> situation, please?
>> Best wishes,
More information about the slurm-users