[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Ward Poelmans
ward.poelmans at vub.be
Sun Nov 5 20:32:37 UTC 2023
Hi Ole,
Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11
And we add it as a dependency for slurmd:
$ cat /etc/systemd/system/slurmd.service.d/wait.conf
[Service]
Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
LimitMEMLOCK=infinity
[Unit]
After=waitforib.service
Requires=munge.service
Wants=waitforib.service
So far this has worked flawlessly.
Ward
On 2/11/2023 09:28, Ole Holm Nielsen wrote:
> Hi Ward,
>
> Thanks a lot for the feedback! The method of probing /sys/class/infiniband/*/ports/*/state is also used in the NHC script lbnl_hw.nhc and has the advantage of not depending on the nmcli command from the NetworkManager package.
>
> Can I ask you how you implement your script as a service in the Systemd booting process, perhaps similar to Max's solution in https://github.com/maxlxl/network.target_wait-for-interfaces ?
>
> Thanks,
> Ole
>
> On 11/1/23 20:09, Ward Poelmans wrote:
>> We have a slightly difference script to do the same. It only relies on /sys:
>>
>> # Search for infiniband devices and check waits until
>> # at least one reports that it is ACTIVE
>>
>> if [[ ! -d /sys/class/infiniband ]]
>> then
>> logger "No infiniband found"
>> exit 0
>> fi
>>
>> ports=$(ls /sys/class/infiniband/*/ports/*/state)
>>
>> for (( count = 0; count < 300; count++ ))
>> do
>> for port in ${ports}; do
>> if grep -qc ACTIVE $port; then
>> logger "Infiniband online at $port"
>> exit 0
>> fi
>> done
>> sleep 1
>> done
>>
>> logger "Failed to find an active infiniband interface"
>> exit 1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4745 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231105/7f6592ad/attachment.bin>
More information about the slurm-users
mailing list