[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Fri Nov 10 14:04:39 UTC 2023
Hi Ward,
On 11/5/23 21:32, Ward Poelmans wrote:
> Yes, it's very similar. I've put our systemd unit file also online on
> https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11
This looks really good! However, I was testing the waitforib.sh script on
a SuperMicro server WITHOUT Infiniband and only a dual-port Ethernet NIC
(Intel Corporation Ethernet Connection X722 for 10GBASE-T).
The EL8 drivers in kernel 4.18.0-477.27.2.el8_8.x86_64 seem to think that
the Ethernet ports are also Infiniband ports:
# ls -l /sys/class/infiniband
total 0
lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma0 ->
../../devices/pci0000:5d/0000:5d:02.0/0000:5e:00.0/0000:5f:03.0/0000:60:00.0/infiniband/irdma0
lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma1 ->
../../devices/pci0000:5d/0000:5d:02.0/0000:5e:00.0/0000:5f:03.0/0000:60:00.1/infiniband/irdma1
This might disturb the logic in waitforib.sh, or at least cause some
confusion?
One advantage of Max's script using NetworkManager is that nmcli isn't
fooled by the fake irdma Infiniband device:
# nmcli connection show
NAME UUID TYPE DEVICE
eno1 cb0937f8-1902-48f7-8139-37cf0c4077b2 ethernet eno1
eno2 98130354-9215-412e-ab26-032c76c2dbe4 ethernet --
I found a discussion of the mysterious irdma device in
https://github.com/prometheus/node_exporter/issues/2769
with this explanation:
>> The irdma module is Intel's replacement for the legacy i40iw module, which was the iWARP driver for the Intel X722. The irdma module is a complete rewrite, which landed in mainline kernel 5.14, and which also now supports the Intel E810 (iWARP & RoCE).
The Infiniband commands also work on the fake device, claiming that it
runs 100 Gbit/s:
# ibstatus
Infiniband device 'irdma0' port 1 status:
default gid: 3cec:ef38:d960:0000:0000:0000:0000:0000
base lid: 0x1
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: Ethernet
Infiniband device 'irdma1' port 1 status:
default gid: 3cec:ef38:d961:0000:0000:0000:0000:0000
base lid: 0x1
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 100 Gb/sec (4X EDR)
link_layer: Ethernet
IMHO, this seems quite confusing.
Regarding the slurmd service:
> And we add it as a dependency for slurmd:
>
> $ cat /etc/systemd/system/slurmd.service.d/wait.conf
>
> [Service]
> Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
> LimitMEMLOCK=infinity
>
> [Unit]
> After=waitforib.service
> Requires=munge.service
> Wants=waitforib.service
An alternative to this extra service would be like Max's service file
https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service
which has:
Before=network-online.target
What do you think of these considerations?
Best regards,
Ole
> On 2/11/2023 09:28, Ole Holm Nielsen wrote:
>> Hi Ward,
>>
>> Thanks a lot for the feedback! The method of probing
>> /sys/class/infiniband/*/ports/*/state is also used in the NHC script
>> lbnl_hw.nhc and has the advantage of not depending on the nmcli command
>> from the NetworkManager package.
>>
>> Can I ask you how you implement your script as a service in the Systemd
>> booting process, perhaps similar to Max's solution in
>> https://github.com/maxlxl/network.target_wait-for-interfaces ?
>>
>> Thanks,
>> Ole
>>
>> On 11/1/23 20:09, Ward Poelmans wrote:
>>> We have a slightly difference script to do the same. It only relies on
>>> /sys:
>>>
>>> # Search for infiniband devices and check waits until
>>> # at least one reports that it is ACTIVE
>>>
>>> if [[ ! -d /sys/class/infiniband ]]
>>> then
>>> logger "No infiniband found"
>>> exit 0
>>> fi
>>>
>>> ports=$(ls /sys/class/infiniband/*/ports/*/state)
>>>
>>> for (( count = 0; count < 300; count++ ))
>>> do
>>> for port in ${ports}; do
>>> if grep -qc ACTIVE $port; then
>>> logger "Infiniband online at $port"
>>> exit 0
>>> fi
>>> done
>>> sleep 1
>>> done
>>>
>>> logger "Failed to find an active infiniband interface"
>>> exit 1
>
More information about the slurm-users
mailing list