[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

Ward Poelmans ward.poelmans at vub.be
Sun Nov 5 20:32:37 UTC 2023


Hi Ole,

Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11

And we add it as a dependency for slurmd:

$ cat /etc/systemd/system/slurmd.service.d/wait.conf

[Service]
Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
LimitMEMLOCK=infinity

[Unit]
After=waitforib.service
Requires=munge.service
Wants=waitforib.service


So far this has worked flawlessly.


Ward



On 2/11/2023 09:28, Ole Holm Nielsen wrote:
> Hi Ward,
> 
> Thanks a lot for the feedback!  The method of probing /sys/class/infiniband/*/ports/*/state is also used in the NHC script lbnl_hw.nhc and has the advantage of not depending on the nmcli command from the NetworkManager package.
> 
> Can I ask you how you implement your script as a service in the Systemd booting process, perhaps similar to Max's solution in https://github.com/maxlxl/network.target_wait-for-interfaces ?
> 
> Thanks,
> Ole
> 
> On 11/1/23 20:09, Ward Poelmans wrote:
>> We have a slightly difference script to do the same. It only relies on /sys:
>>
>> # Search for infiniband devices and check waits until
>> # at least one reports that it is ACTIVE
>>
>> if [[ ! -d /sys/class/infiniband ]]
>> then
>>      logger "No infiniband found"
>>      exit 0
>> fi
>>
>> ports=$(ls /sys/class/infiniband/*/ports/*/state)
>>
>> for (( count = 0; count < 300; count++ ))
>> do
>>      for port in ${ports}; do
>>          if grep -qc ACTIVE $port; then
>>              logger "Infiniband online at $port"
>>              exit 0
>>          fi
>>      done
>>      sleep 1
>> done
>>
>> logger "Failed to find an active infiniband interface"
>> exit 1

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4745 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231105/7f6592ad/attachment.bin>


More information about the slurm-users mailing list