[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

Rémi Palancher remi at rackslab.io
Wed Nov 1 08:44:29 UTC 2023


Hi Ole,

Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit :
> I'm fighting this strange scenario where slurmd is started before the
> Infiniband/OPA network is fully up.  The Node Health Check (NHC) executed
> by slurmd then fails the node (as it should).  This happens only on EL8
> Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with
> Infiniband/OPA network work without problems.
> 
> Question: Does anyone know how to reliably delay the start of the slurmd
> Systemd service until the Infiniband/OPA network is fully up?
> 
>
FWIW, after a while struggling with systemd dependencies to wait for 
availability of networks and shared filesystems, we ended up with a 
customer writing a patch in Slurm to delay slurmd registration (and jobs 
start) until NHC is OK:

https://github.com/scibian/slurm-wlm/blob/scibian/buster/debian/patches/b31fa177c1ca26dcd2d5cd952e692ef87d95b528

For the record, this patch was once merged in Slurm and then reverted[1] 
for reasons I did not fully explore.

This approach is far from your original idea, it is clearly not ideal 
and should be taken with caution but it works for years for this customer.

[1] 
https://github.com/SchedMD/slurm/commit/b31fa177c1ca26dcd2d5cd952e692ef87d95b528

-- 
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io/




More information about the slurm-users mailing list