[slurm-users] Drain node from TaskProlog / TaskEpilog
Mark Dixon
mark.c.dixon at durham.ac.uk
Tue May 25 12:09:17 UTC 2021
Thanks to everyone for their help, much appreciated.
Seems to confirm that things would be much easier if I could just figure
out a way to detect the issue from the prolog/epilog, rather than the
taskprolog/taskepilog!
All the best,
Mark
On Mon, 24 May 2021, Brian Andrus wrote:
> [EXTERNAL EMAIL]
>
> Ah. I'll proceed under the scenario that there is a piece of hardware
> that is being tested and may lock up (The GPU in this case).
>
> If you are able to identify the issue is occurring from within the job,
> you should exit the job with an error or some signal to alert slurm (eg:
> a semaphore file). You can then use something like EpilogSlurmctld to
> recognize that and reboot the node accordingly.
>
> This is presuming the node needs a full reboot, which I am guessing
> affects the entire job. If you are able to do something like
> unload/reload the cuda drivers between tasks, that may be a way to
> continue the job while still 'fixing' the issue. That could be done in
> the TaskEpilog script (assuming your daemon user has permissions to do so).
>
> On 5/24/2021 8:56 AM, Mark Dixon wrote:
>> Hi Brian,
>>
>> Thanks for replying. On our hardware, GPUs allocated to a job by
>> cgroup sometimes get themselves into a state requiring a reboot.
>>
>> Outside the job, a simple CUDA program calling the API function
>> cudaGetDeviceCount works happily. Inside the job, it returns an error
>> code of 3 (cudaErrorInitializationError).
>>
>> At present, I have a TaskProlog that prods this API function and
>> emails me when there is a failure. It'd be nice if the nodes could
>> drain themselves without administrator intervention, rather than
>> continuing to run waiting jobs and so causing them to fail.
>>
>> I can see a couple of ways to do it (e.g. sudo script in TaskProlog,
>> or playing with the cgroup hierarchy outside of slurm), but was
>> wondering if I had misunderstood the slurm docs and there was a
>> simpler way.
>>
>> Best,
>>
>> Mark
>>
>> On Mon, 24 May 2021, Brian Andrus wrote:
>>
>>> Not sure I can understand how it can only be detected from inside the
>>> job environment for a failed node.
>>>
>>> That description is more of "our application is behaving badly, but not
>>> so bad, the node quits responding." For that situation, your app or job
>>> should have something that it is doing to catch that and report it to
>>> slurm in some fashion (up to and including, kill the process).
>>>
>>> Slurm polls the nodes and if slurmd does not respond, it will mark the
>>> node as failed. So slurmd must be responding.
>>>
>>> If you can provide a better description of what symptoms you see that
>>> cause you to feel the node has failed, we can help a little more.
>>>
>>> On 5/24/2021 3:02 AM, Mark Dixon wrote:
>>>> Hi all,
>>>>
>>>> Sometimes our compute nodes get into a failed state which we can only
>>>> detect from inside the job environment.
>>>>
>>>> I can see that TaskProlog / TaskEpilog allows us to run our detection
>>>> test; however, unlike Epilog and Prolog, they do not drain a node if
>>>> they exit with a non-zero exit code.
>>>>
>>>> Does anyone have advice on automatically draining a node in this
>>>> situation, please?
>>>>
>>>> Best wishes,
>>>>
>>>> Mark
>>>>
>>>
>>>
>>>
>>
>
>
>
More information about the slurm-users
mailing list