[slurm-users] Compute nodes cycling from idle to down on a regular basis ?

Wed Feb 2 15:27:02 UTC 2022

Hi Jeremy,

What is the value of TreeWidth in your slurm.conf? If there is no entry
then I recommend setting it to a value a bit larger than the number of
nodes you have in your cluster and then restarting slurmctld.

Best,

Steve

On Wed, Feb 2, 2022 at 12:59 AM Jeremy Fix <Jeremy.Fix at centralesupelec.fr>
wrote:

> Hi,
>
> A follow-up. I though some of nodes were ok but that's not the case;
> This morning, another pool of consecutive (why consecutive by the way?
> they are always consecutively failing) compute nodes are idle* . And now
> of the nodes which were drained came back to life in idle and now again
> switched to idle*.
>
> One thing I should mention is that the master is now handling a total of
> 148 nodes; That's the new pool of 100 nodes which have a cycling state.
> The previous 48 nodes that already handled by this master are ok.
>
> I do not know if this should be considered a large system but we tried
> to have a look to settings such as the ARP cache [1] on the slurm
> master. I'm not very familiar with that, it seems to me it enlarges the
> cache of the node names/IPs table. This morning, the master has 125
> lines in "arp -a" (before changing the settings in systctl , it was
> like, 20 or so); Do you think  this settings is also necessary on the
> compute nodes ?
>
> Best;
>
> Jeremy.
>
>
> [1]
>
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
>
>
>
>
>

-- 
________________________________________________________________
 Steve Cousins             Supercomputer Engineer/Administrator
 Advanced Computing Group            University of Maine System
 244 Neville Hall (UMS Data Center)              (207) 581-3574
 Orono ME 04469                      steve.cousins at maine.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220202/d0960c24/attachment.htm>