[slurm-users] A strange situation of different network cards on the same network

Wed Oct 11 02:29:44 UTC 2023

We have a cluster of 176 nodes consisting Infiniband switch and 10GbE 
and we are using 10GbE as SSH. Currently we have the older cards of
Marvell 10GbE at launch
https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886 
<https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886>

and
Current batch of 10GbE Qlogic card
https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory 
<https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory>

We are using slurm 20.11.4 as server and node health check daemon are 
also deployed using the OpenHPC method. However , we have no issue on 
using the Marvell 10GbE cards - which don't have slurm node down <--> 
idle state. However, we do have the flip-flip situation of the down <--> 
idle state

We tried on increasing the ARP caching , changing the subversion of the 
client to 20.11.9 , which doesn't help with the situation.

We would like to see if anyone faced similar situation?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231011/cb75db13/attachment-0001.htm>