[slurm-users] Compute nodes cycling from idle to down on a regular basis ?
Tina Friedrich
tina.friedrich at it.ox.ac.uk
Wed Feb 2 10:41:58 UTC 2022
Hi Jeremy,
I haven't got anything very intelligent to contribute to solve your problem.
However, what I can tell you is that we run our production cluster with
one SLURM master running on a virtual machine handling just over 300
nodes. We have never seen the sort of problem you have other than when
there was a problem contacting the nodes.
The VM running slurmctld doesn't get any tuning, it's a stock CentOS 8
server. No increase any caching (ARP or otherwise) on the master. I just
checked and I don't even think I'm doing anything special about process
or memory limits for the user the SLURM proceses run as.
I have - from time to time - have the controller go unresponsive for a
moment, but that's usually to do with lots of prologs/epilogs happening
at the same time, and it does not cause node status to flap like that.
So unless you have indications of load on your master being very high or
memory pressure on the master, I wouldn't suspect the master not coping
for this.
(I don't do host files, I use DNS. But that really shouldn't make a
difference.)
A lot of people have said name resolution - and yes, that could - but
I'm actually also wondering if you might have a network problem
somewhere? Ethernet, I mean? Congestion, or corrupted packages?
Multipathing or path failover or spanning tree going wrong or flapping?
Tina
On 02/02/2022 05:56, Jeremy Fix wrote:
> Hi,
>
> A follow-up. I though some of nodes were ok but that's not the case;
> This morning, another pool of consecutive (why consecutive by the way?
> they are always consecutively failing) compute nodes are idle* . And now
> of the nodes which were drained came back to life in idle and now again
> switched to idle*.
>
> One thing I should mention is that the master is now handling a total of
> 148 nodes; That's the new pool of 100 nodes which have a cycling state.
> The previous 48 nodes that already handled by this master are ok.
>
> I do not know if this should be considered a large system but we tried
> to have a look to settings such as the ARP cache [1] on the slurm
> master. I'm not very familiar with that, it seems to me it enlarges the
> cache of the node names/IPs table. This morning, the master has 125
> lines in "arp -a" (before changing the settings in systctl , it was
> like, 20 or so); Do you think this settings is also necessary on the
> compute nodes ?
>
> Best;
>
> Jeremy.
>
>
> [1]
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
>
>
>
>
>
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
More information about the slurm-users
mailing list