[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

Mon Oct 30 12:50:01 UTC 2023

I'm fighting this strange scenario where slurmd is started before the 
Infiniband/OPA network is fully up.  The Node Health Check (NHC) executed 
by slurmd then fails the node (as it should).  This happens only on EL8 
Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with 
Infiniband/OPA network work without problems.

Question: Does anyone know how to reliably delay the start of the slurmd 
Systemd service until the Infiniband/OPA network is fully up?

Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not 
Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers since the 
CornelisNetworks drivers are not available for RHEL 8.8.

The details:

The slurmd service is started by the service file 
/usr/lib/systemd/system/slurmd.service after the "network-online.target" 
has been reached.

It seems that NetworkManager reports "network-online.target" BEFORE the 
Infiniband/OPA device ib0 is actually up, and this seems to be the cause 
of our problems!

Here are some important sequences of events from the syslog showing that 
the network goes online before the Infiniband/OPA network (hfi1_0 adapter) 
is up:

Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
(lines deleted)
Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check failed: 
rc:1 output:ERROR:  nhc:  Health check failed:  check_hw_ib:  No IB port 
is ACTIVE (LinkUp 100 Gb/sec).
(lines deleted)
Oct 30 13:01:41 d064 kernel: hfi1 0000:4b:00.0: hfi1_0: 8051: Link up
Oct 30 13:01:41 d064 kernel: hfi1 0000:4b:00.0: hfi1_0: set_link_state: 
current GOING_UP, new INIT (LINKUP)
Oct 30 13:01:41 d064 kernel: hfi1 0000:4b:00.0: hfi1_0: physical state 
changed to PHYS_LINKUP (0x5), phy 0x50

I tried to delay the NetworkManager "network-online.target" by setting a 
wait on the ib0 device and reboot, but that seems to be ignored:

$ nmcli -p connection modify "System ib0" 
connection.connection.wait-device-timeout 20

I'm hoping that other sites using Omni-Path have seen this and maybe can 
share a fix or workaround?

Of course we could remove the Infiniband check in Node Health Check (NHC), 
but that would not really be acceptable during operations.

Thanks for sharing any insights,
Ole

-- 
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark