[slurm-users] Compute nodes cycling from idle to down on a regular basis ?

Wed Feb 2 18:56:48 UTC 2022

Hello , Thank you for your suggestion and I thank also thank Tina;

To answer your question, there is no TreeWidth entry in the slurm.conf

But it seems we figured out the issue .... and I'm so sorry we did not 
think about it : we already had a pool of 48 nodes on the master but 
their slurm.conf diverged from the ones on the pool of dancing state 
nodes; At least, their slurmd was not restarted;

And actually several people suggested that the slurmd need to talk 
between each other; That's really our fault; 100 nodes were aware of all 
the 148 nodes and 48 nodes were only aware of themselves; I suppose that 
created issues to the master;

So even if we also had other issues like interfaces flip flopping, the 
diverged slurm.conf was probably the issue.

Thank you all for your help, It is time to compute :)

Jeremy.

On 02/02/2022 16:27, Stephen Cousins wrote:
> Hi Jeremy,
>
> What is the value of TreeWidth in your slurm.conf? If there is no 
> entry then I recommend setting it to a value a bit larger than the 
> number of nodes you have in your cluster and then restarting slurmctld.
>
> Best,
>
> Steve
>
> On Wed, Feb 2, 2022 at 12:59 AM Jeremy Fix 
> <Jeremy.Fix at centralesupelec.fr> wrote:
>
>     Hi,
>
>     A follow-up. I though some of nodes were ok but that's not the case;
>     This morning, another pool of consecutive (why consecutive by the
>     way?
>     they are always consecutively failing) compute nodes are idle* .
>     And now
>     of the nodes which were drained came back to life in idle and now
>     again
>     switched to idle*.
>
>     One thing I should mention is that the master is now handling a
>     total of
>     148 nodes; That's the new pool of 100 nodes which have a cycling
>     state.
>     The previous 48 nodes that already handled by this master are ok.
>
>     I do not know if this should be considered a large system but we
>     tried
>     to have a look to settings such as the ARP cache [1] on the slurm
>     master. I'm not very familiar with that, it seems to me it
>     enlarges the
>     cache of the node names/IPs table. This morning, the master has 125
>     lines in "arp -a" (before changing the settings in systctl , it was
>     like, 20 or so); Do you think  this settings is also necessary on the
>     compute nodes ?
>
>     Best;
>
>     Jeremy.
>
>
>     [1]
>     https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
>
>
>
>
>
>
> -- 
> ________________________________________________________________
>  Steve Cousins Supercomputer Engineer/Administrator
>  Advanced Computing Group           University of Maine System
>  244 Neville Hall (UMS Data Center)              (207) 581-3574
>  Orono ME 04469                      steve.cousins at maine.edu 
> <http://maine.edu>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220202/5ff054c9/attachment.htm>