[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Thu Nov 2 08:28:28 UTC 2023
Hi Ward,
Thanks a lot for the feedback! The method of probing
/sys/class/infiniband/*/ports/*/state is also used in the NHC script
lbnl_hw.nhc and has the advantage of not depending on the nmcli command
from the NetworkManager package.
Can I ask you how you implement your script as a service in the Systemd
booting process, perhaps similar to Max's solution in
https://github.com/maxlxl/network.target_wait-for-interfaces ?
Thanks,
Ole
On 11/1/23 20:09, Ward Poelmans wrote:
> We have a slightly difference script to do the same. It only relies on /sys:
>
> # Search for infiniband devices and check waits until
> # at least one reports that it is ACTIVE
>
> if [[ ! -d /sys/class/infiniband ]]
> then
> logger "No infiniband found"
> exit 0
> fi
>
> ports=$(ls /sys/class/infiniband/*/ports/*/state)
>
> for (( count = 0; count < 300; count++ ))
> do
> for port in ${ports}; do
> if grep -qc ACTIVE $port; then
> logger "Infiniband online at $port"
> exit 0
> fi
> done
> sleep 1
> done
>
> logger "Failed to find an active infiniband interface"
> exit 1
More information about the slurm-users
mailing list