[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

Ward Poelmans ward.poelmans at vub.be
Fri Nov 10 18:45:19 UTC 2023


Hi Ole,

On 10/11/2023 15:04, Ole Holm Nielsen wrote:
> On 11/5/23 21:32, Ward Poelmans wrote:
>> Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11
> 
> This might disturb the logic in waitforib.sh, or at least cause some confusion?

I had never heard of these cards. But if they behave like infiniband cards, is there also an .../ports/1/state file present in /sys with the state? In that case it should work just as well.

We could also change the glob '/sys/class/infiniband/*/ports/*/state' to only look at devices starting with mlx. I have no clue how much diversity is out there, we only have Mellanox cards (or rebrands of those).
  
> IMHO, this seems quite confusing.

Yes, I agree.
  
> Regarding the slurmd service:
  
> An alternative to this extra service would be like Max's service file https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service which has:
> Before=network-online.target
> 
> What do you think of these considerations?

I think Max his approach is the better one. We only do it for slurmd while his is completely general for everything that waits on network. The downside is probably that if you have issue with your IB network, this will make it worse ;)

Ward
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4745 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231110/d0498633/attachment-0001.bin>


More information about the slurm-users mailing list