[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

Wed Nov 1 09:16:38 UTC 2023

Hi Rémi,

Thanks for the feedback!  The patch revert[1] explains SchedMD's reason:

> The reasoning is that sysadmins who see nodes with Reason "Not Responding"
> but they can manually ping/access the node end up confused. That reason
> should only be set if the node is trully not responding, but not if the
> HealthCheckProgram execution failed or returned non-zero exit code. For
> that case, the program itself would take the appropiate actions, such
> as draining the node and setting an appropiate Reason.

We speculate that there may possibly be an issue with slurmd starting up 
at boot time and starting new jobs, while NHC is running in a separate 
thread and possibly fails the node AFTER the job has started!  NHC might 
fail, for example, if an Infiniband/OPA network or NVIDIA GPUs have not 
yet started up completely.

I still need to verify whether this observation is correct and 
reproducible.  Does anyone have evidence that jobs start before NHC is 
complete when slurmd starts up?

IMHO, slurmd ought to start up without delay at boot time, then execute 
the NHC and wait for it to complete.  Only after NHC has succeeded without 
errors should slurmd begin accepting new jobs.

We should configure NHC to make site-specific hardware and network checks, 
for example for Infiniband/OPA network or NVIDIA GPUs.

Best regards,
Ole

On 11/1/23 09:44, Rémi Palancher wrote:
> Hi Ole,
> 
> Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit :
>> I'm fighting this strange scenario where slurmd is started before the
>> Infiniband/OPA network is fully up.  The Node Health Check (NHC) executed
>> by slurmd then fails the node (as it should).  This happens only on EL8
>> Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with
>> Infiniband/OPA network work without problems.
>>
>> Question: Does anyone know how to reliably delay the start of the slurmd
>> Systemd service until the Infiniband/OPA network is fully up?
>>
>> …
> 
> FWIW, after a while struggling with systemd dependencies to wait for
> availability of networks and shared filesystems, we ended up with a
> customer writing a patch in Slurm to delay slurmd registration (and jobs
> start) until NHC is OK:
> 
> https://github.com/scibian/slurm-wlm/blob/scibian/buster/debian/patches/b31fa177c1ca26dcd2d5cd952e692ef87d95b528
> 
> For the record, this patch was once merged in Slurm and then reverted[1]
> for reasons I did not fully explore.
> 
> This approach is far from your original idea, it is clearly not ideal
> and should be taken with caution but it works for years for this customer.
> 
> [1]
> https://github.com/SchedMD/slurm/commit/b31fa177c1ca26dcd2d5cd952e692ef87d95b528
> 

-- 
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: Ole.H.Nielsen at fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620