[slurm-users] A strange situation of different network cards on the same network
James Lam
unison2004 at gmail.com
Wed Oct 11 02:29:44 UTC 2023
We have a cluster of 176 nodes consisting Infiniband switch and 10GbE
and we are using 10GbE as SSH. Currently we have the older cards of
Marvell 10GbE at launch
https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886
<https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886>
and
Current batch of 10GbE Qlogic card
https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory
<https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory>
We are using slurm 20.11.4 as server and node health check daemon are
also deployed using the OpenHPC method. However , we have no issue on
using the Marvell 10GbE cards - which don't have slurm node down <-->
idle state. However, we do have the flip-flip situation of the down <-->
idle state
We tried on increasing the ARP caching , changing the subversion of the
client to 20.11.9 , which doesn't help with the situation.
We would like to see if anyone faced similar situation?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231011/cb75db13/attachment-0001.htm>
More information about the slurm-users
mailing list