[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Fri Nov 10 14:04:39 UTC 2023

Hi Ward,

On 11/5/23 21:32, Ward Poelmans wrote:
> Yes, it's very similar. I've put our systemd unit file also online on 
> https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11

This looks really good!  However, I was testing the waitforib.sh script on 
a SuperMicro server WITHOUT Infiniband and only a dual-port Ethernet NIC 
(Intel Corporation Ethernet Connection X722 for 10GBASE-T).

The EL8 drivers in kernel 4.18.0-477.27.2.el8_8.x86_64 seem to think that 
the Ethernet ports are also Infiniband ports:

# ls -l /sys/class/infiniband
total 0
lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma0 -> 
lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma1 -> 

This might disturb the logic in waitforib.sh, or at least cause some 

One advantage of Max's script using NetworkManager is that nmcli isn't 
fooled by the fake irdma Infiniband device:

# nmcli connection show
NAME  UUID                                  TYPE      DEVICE
eno1  cb0937f8-1902-48f7-8139-37cf0c4077b2  ethernet  eno1
eno2  98130354-9215-412e-ab26-032c76c2dbe4  ethernet  --

I found a discussion of the mysterious irdma device in
with this explanation:

>> The irdma module is Intel's replacement for the legacy i40iw module, which was the iWARP driver for the Intel X722. The irdma module is a complete rewrite, which landed in mainline kernel 5.14, and which also now supports the Intel E810 (iWARP & RoCE).

The Infiniband commands also work on the fake device, claiming that it 
runs 100 Gbit/s:

# ibstatus
Infiniband device 'irdma0' port 1 status:
	default gid:	 3cec:ef38:d960:0000:0000:0000:0000:0000
	base lid:	 0x1
	sm lid:		 0x0
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 Ethernet

Infiniband device 'irdma1' port 1 status:
	default gid:	 3cec:ef38:d961:0000:0000:0000:0000:0000
	base lid:	 0x1
	sm lid:		 0x0
	state:		 1: DOWN
	phys state:	 3: Disabled
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 Ethernet

IMHO, this seems quite confusing.

Regarding the slurmd service:

> And we add it as a dependency for slurmd:
> $ cat /etc/systemd/system/slurmd.service.d/wait.conf
> [Service]
> LimitMEMLOCK=infinity
> [Unit]
> After=waitforib.service
> Requires=munge.service
> Wants=waitforib.service

An alternative to this extra service would be like Max's service file 
which has:

What do you think of these considerations?

Best regards,

> On 2/11/2023 09:28, Ole Holm Nielsen wrote:
>> Hi Ward,
>> Thanks a lot for the feedback!  The method of probing 
>> /sys/class/infiniband/*/ports/*/state is also used in the NHC script 
>> lbnl_hw.nhc and has the advantage of not depending on the nmcli command 
>> from the NetworkManager package.
>> Can I ask you how you implement your script as a service in the Systemd 
>> booting process, perhaps similar to Max's solution in 
>> https://github.com/maxlxl/network.target_wait-for-interfaces ?
>> Thanks,
>> Ole
>> On 11/1/23 20:09, Ward Poelmans wrote:
>>> We have a slightly difference script to do the same. It only relies on 
>>> /sys:
>>> # Search for infiniband devices and check waits until
>>> # at least one reports that it is ACTIVE
>>> if [[ ! -d /sys/class/infiniband ]]
>>> then
>>>      logger "No infiniband found"
>>>      exit 0
>>> fi
>>> ports=$(ls /sys/class/infiniband/*/ports/*/state)
>>> for (( count = 0; count < 300; count++ ))
>>> do
>>>      for port in ${ports}; do
>>>          if grep -qc ACTIVE $port; then
>>>              logger "Infiniband online at $port"
>>>              exit 0
>>>          fi
>>>      done
>>>      sleep 1
>>> done
>>> logger "Failed to find an active infiniband interface"
>>> exit 1

More information about the slurm-users mailing list