[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Mon Nov 13 13:27:54 UTC 2023


Hi Max and Ward,

I've made a variation of your scripts which wait for at least 1 Infiniband 
port to come up before starting services such as slurmd or NFS mounts.

I prefer Max's Systemd service which comes before the Systemd 
network-online.target.  And I like Ward's script which checks the 
Infiniband status in /sys/class/infiniband/ in stead of relying on 
NetworkManager being installed.

At our site there are different types of compute nodes with different 
types of NICs:

1. Mellanox Infiniband.
2. Cornelis Omni-Path behaving just like Infiniband.
3. Intel X722 Ethernet NICs presenting a "fake" iRDMA Infiniband.
4. Plain Ethernet only.

I've written some modified scripts which are available in
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/InfiniBand
and which have been tested on the 4 types of NICs listed above.

The case 3. is particularly troublesome as reported earlier because it's 
an Ethernet port which presents an iRDMA InfiniBand interface.  My 
waitforib.sh script skips NICs whose link_layer type is not equal to 
InfiniBand.

Comments and suggestions would be most welcome.

Best regards,
Ole

On 11/10/23 19:45, Ward Poelmans wrote:
> Hi Ole,
> 
> On 10/11/2023 15:04, Ole Holm Nielsen wrote:
>> On 11/5/23 21:32, Ward Poelmans wrote:
>>> Yes, it's very similar. I've put our systemd unit file also online on 
>>> https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11
>>
>> This might disturb the logic in waitforib.sh, or at least cause some 
>> confusion?
> 
> I had never heard of these cards. But if they behave like infiniband 
> cards, is there also an .../ports/1/state file present in /sys with the 
> state? In that case it should work just as well.
> 
> We could also change the glob '/sys/class/infiniband/*/ports/*/state' to 
> only look at devices starting with mlx. I have no clue how much diversity 
> is out there, we only have Mellanox cards (or rebrands of those).
> 
>> IMHO, this seems quite confusing.
> 
> Yes, I agree.
> 
>> Regarding the slurmd service:
> 
>> An alternative to this extra service would be like Max's service file 
>> https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service which has:
>> Before=network-online.target
>>
>> What do you think of these considerations?
> 
> I think Max his approach is the better one. We only do it for slurmd while 
> his is completely general for everything that waits on network. The 
> downside is probably that if you have issue with your IB network, this 
> will make it worse ;)
> 
> Ward



More information about the slurm-users mailing list