[slurm-users] Compute nodes cycling from idle to down on a regular basis ?

Wed Feb 2 10:41:58 UTC 2022

Hi Jeremy,

I haven't got anything very intelligent to contribute to solve your problem.

However, what I can tell you is that we run our production cluster with 
one SLURM master running on a virtual machine handling just over 300 
nodes. We have never seen the sort of problem you have other than when 
there was a problem contacting the nodes.

The VM running slurmctld doesn't get any tuning, it's a stock CentOS 8 
server. No increase any caching (ARP or otherwise) on the master. I just 
checked and I don't even think I'm doing anything special about process 
or memory limits for the user the SLURM proceses run as.

I have - from time to time - have the controller go unresponsive for a 
moment, but that's usually to do with lots of prologs/epilogs happening 
at the same time, and it does not cause node status to flap like that.

So unless you have indications of load on your master being very high or 
memory pressure on the master, I wouldn't suspect the master not coping 
for this.

(I don't do host files, I use DNS. But that really shouldn't make a 
difference.)

A lot of people have said name resolution - and yes, that could - but 
I'm actually also wondering if you might have a network problem 
somewhere? Ethernet, I mean? Congestion, or corrupted packages? 
Multipathing or path failover or spanning tree going wrong or flapping?

Tina

On 02/02/2022 05:56, Jeremy Fix wrote:
> Hi,
> 
> A follow-up. I though some of nodes were ok but that's not the case; 
> This morning, another pool of consecutive (why consecutive by the way? 
> they are always consecutively failing) compute nodes are idle* . And now 
> of the nodes which were drained came back to life in idle and now again 
> switched to idle*.
> 
> One thing I should mention is that the master is now handling a total of 
> 148 nodes; That's the new pool of 100 nodes which have a cycling state. 
> The previous 48 nodes that already handled by this master are ok.
> 
> I do not know if this should be considered a large system but we tried 
> to have a look to settings such as the ARP cache [1] on the slurm 
> master. I'm not very familiar with that, it seems to me it enlarges the 
> cache of the node names/IPs table. This morning, the master has 125 
> lines in "arp -a" (before changing the settings in systctl , it was 
> like, 20 or so); Do you think  this settings is also necessary on the 
> compute nodes ?
> 
> Best;
> 
> Jeremy.
> 
> 
> [1] 
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks 
> 
> 
> 
> 
> 

-- 
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk