[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

Max Rutkowski max.rutkowski at gfz-potsdam.de
Mon Oct 30 13:30:53 UTC 2023


Hi,

we're not using Omni-Path but also had issues with Infiniband taking too 
long and slurmd failing to start due to that.

Our solution was to implement a little wait-for-interface systemd 
service which delays the network.target until the ib interface has come up.

Our discovery was that the network-online.target is triggered by the 
NetworkManager as soon as the first interface is connected.

I've put the solution we use on my GitHub: 
https://github.com/maxlxl/network.target_wait-for-interfaces

You may need to do small adjustments, but it's pretty straight forward 
in general.


Kind regards
Max

On 30.10.23 13:50, Ole Holm Nielsen wrote:
> I'm fighting this strange scenario where slurmd is started before the 
> Infiniband/OPA network is fully up.  The Node Health Check (NHC) 
> executed by slurmd then fails the node (as it should).  This happens 
> only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes 
> with Infiniband/OPA network work without problems.
>
> Question: Does anyone know how to reliably delay the start of the 
> slurmd Systemd service until the Infiniband/OPA network is fully up?
>
> Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not 
> Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers since 
> the CornelisNetworks drivers are not available for RHEL 8.8.
>
> The details:
>
> The slurmd service is started by the service file 
> /usr/lib/systemd/system/slurmd.service after the 
> "network-online.target" has been reached.
>
> It seems that NetworkManager reports "network-online.target" BEFORE 
> the Infiniband/OPA device ib0 is actually up, and this seems to be the 
> cause of our problems!
>
> Here are some important sequences of events from the syslog showing 
> that the network goes online before the Infiniband/OPA network (hfi1_0 
> adapter) is up:
>
> Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
> (lines deleted)
> Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check failed: 
> rc:1 output:ERROR:  nhc:  Health check failed: check_hw_ib:  No IB 
> port is ACTIVE (LinkUp 100 Gb/sec).
> (lines deleted)
> Oct 30 13:01:41 d064 kernel: hfi1 0000:4b:00.0: hfi1_0: 8051: Link up
> Oct 30 13:01:41 d064 kernel: hfi1 0000:4b:00.0: hfi1_0: 
> set_link_state: current GOING_UP, new INIT (LINKUP)
> Oct 30 13:01:41 d064 kernel: hfi1 0000:4b:00.0: hfi1_0: physical state 
> changed to PHYS_LINKUP (0x5), phy 0x50
>
> I tried to delay the NetworkManager "network-online.target" by setting 
> a wait on the ib0 device and reboot, but that seems to be ignored:
>
> $ nmcli -p connection modify "System ib0" 
> connection.connection.wait-device-timeout 20
>
> I'm hoping that other sites using Omni-Path have seen this and maybe 
> can share a fix or workaround?
>
> Of course we could remove the Infiniband check in Node Health Check 
> (NHC), but that would not really be acceptable during operations.
>
> Thanks for sharing any insights,
> Ole
>
-- 
Max Rutkowski
IT-Services und IT-Betrieb
Tel.: +49 (0)331/6264-2341
E-Mail: max.rutkowski at gfz-potsdam.de
___________________________________

Helmholtz-Zentrum Potsdam
*Deutsches GeoForschungsZentrum GFZ*
Stiftung des öff. Rechts Land Brandenburg
Telegrafenberg, 14473 Potsdam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231030/6f077b02/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6030 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231030/6f077b02/attachment.bin>


More information about the slurm-users mailing list