[slurm-users] A strange situation of different network cards on the same network

Wed Oct 11 02:45:27 UTC 2023

We have, and have had it come and go with no clear explanation. I’d watch out for MTU and netmask troubles, sysctl limits that might be relevant (apparently the default settings for time spent doing ethernet are really appropriate for <1 Gb, not so much faster), hot spots on the network, etc.

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'

On Oct 10, 2023, at 22:29, James Lam <unison2004 at gmail.com> wrote:

We have a cluster of 176 nodes consisting Infiniband switch and 10GbE and we are using 10GbE as SSH. Currently we have the older cards of
Marvell 10GbE at launch
https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886

and
Current batch of 10GbE Qlogic card
https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory

We are using slurm 20.11.4 as server and node health check daemon are also deployed using the OpenHPC method.  However , we have no issue on using the Marvell 10GbE cards - which don't have slurm node down <--> idle state. However, we do have the flip-flip situation of the down <--> idle state

We tried on increasing the ARP caching , changing the subversion of the client to 20.11.9 , which doesn't help with the situation.

We would like to see if anyone faced similar situation?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231011/90eabc44/attachment-0003.htm>