[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Fri Nov 10 14:04:39 UTC 2023


Hi Ward,

On 11/5/23 21:32, Ward Poelmans wrote:
> Yes, it's very similar. I've put our systemd unit file also online on 
> https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11

This looks really good!  However, I was testing the waitforib.sh script on 
a SuperMicro server WITHOUT Infiniband and only a dual-port Ethernet NIC 
(Intel Corporation Ethernet Connection X722 for 10GBASE-T).

The EL8 drivers in kernel 4.18.0-477.27.2.el8_8.x86_64 seem to think that 
the Ethernet ports are also Infiniband ports:

# ls -l /sys/class/infiniband
total 0
lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma0 -> 
../../devices/pci0000:5d/0000:5d:02.0/0000:5e:00.0/0000:5f:03.0/0000:60:00.0/infiniband/irdma0
lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma1 -> 
../../devices/pci0000:5d/0000:5d:02.0/0000:5e:00.0/0000:5f:03.0/0000:60:00.1/infiniband/irdma1

This might disturb the logic in waitforib.sh, or at least cause some 
confusion?

One advantage of Max's script using NetworkManager is that nmcli isn't 
fooled by the fake irdma Infiniband device:

# nmcli connection show
NAME  UUID                                  TYPE      DEVICE
eno1  cb0937f8-1902-48f7-8139-37cf0c4077b2  ethernet  eno1
eno2  98130354-9215-412e-ab26-032c76c2dbe4  ethernet  --

I found a discussion of the mysterious irdma device in
https://github.com/prometheus/node_exporter/issues/2769
with this explanation:

>> The irdma module is Intel's replacement for the legacy i40iw module, which was the iWARP driver for the Intel X722. The irdma module is a complete rewrite, which landed in mainline kernel 5.14, and which also now supports the Intel E810 (iWARP & RoCE).

The Infiniband commands also work on the fake device, claiming that it 
runs 100 Gbit/s:

# ibstatus
Infiniband device 'irdma0' port 1 status:
	default gid:	 3cec:ef38:d960:0000:0000:0000:0000:0000
	base lid:	 0x1
	sm lid:		 0x0
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 Ethernet

Infiniband device 'irdma1' port 1 status:
	default gid:	 3cec:ef38:d961:0000:0000:0000:0000:0000
	base lid:	 0x1
	sm lid:		 0x0
	state:		 1: DOWN
	phys state:	 3: Disabled
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 Ethernet

IMHO, this seems quite confusing.

Regarding the slurmd service:

> And we add it as a dependency for slurmd:
> 
> $ cat /etc/systemd/system/slurmd.service.d/wait.conf
> 
> [Service]
> Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
> LimitMEMLOCK=infinity
> 
> [Unit]
> After=waitforib.service
> Requires=munge.service
> Wants=waitforib.service

An alternative to this extra service would be like Max's service file 
https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service 
which has:
Before=network-online.target

What do you think of these considerations?

Best regards,
Ole

> On 2/11/2023 09:28, Ole Holm Nielsen wrote:
>> Hi Ward,
>>
>> Thanks a lot for the feedback!  The method of probing 
>> /sys/class/infiniband/*/ports/*/state is also used in the NHC script 
>> lbnl_hw.nhc and has the advantage of not depending on the nmcli command 
>> from the NetworkManager package.
>>
>> Can I ask you how you implement your script as a service in the Systemd 
>> booting process, perhaps similar to Max's solution in 
>> https://github.com/maxlxl/network.target_wait-for-interfaces ?
>>
>> Thanks,
>> Ole
>>
>> On 11/1/23 20:09, Ward Poelmans wrote:
>>> We have a slightly difference script to do the same. It only relies on 
>>> /sys:
>>>
>>> # Search for infiniband devices and check waits until
>>> # at least one reports that it is ACTIVE
>>>
>>> if [[ ! -d /sys/class/infiniband ]]
>>> then
>>>      logger "No infiniband found"
>>>      exit 0
>>> fi
>>>
>>> ports=$(ls /sys/class/infiniband/*/ports/*/state)
>>>
>>> for (( count = 0; count < 300; count++ ))
>>> do
>>>      for port in ${ports}; do
>>>          if grep -qc ACTIVE $port; then
>>>              logger "Infiniband online at $port"
>>>              exit 0
>>>          fi
>>>      done
>>>      sleep 1
>>> done
>>>
>>> logger "Failed to find an active infiniband interface"
>>> exit 1
> 




More information about the slurm-users mailing list