[slurm-users] nodes going to down* and getting stuck in that state
tim.s.carlson at gmail.com
Fri May 21 02:26:10 UTC 2021
The SLURM controller AND all the compute nodes need to know who all is in
the cluster. If you want to add a node or it changes IP addresses, you need
to let all the nodes know about this which, for me, usually means
restarting slurmd on the compute nodes.
I just say this because I get caught by this all the time if I add some
nodes and for whatever reason miss restarting one of the slurmd processes
on the compute nodes.
On Wed, May 19, 2021 at 9:17 PM Herc Silverstein <
herc.silverstein at schrodinger.com> wrote:
> We have a cluster (in Google gcp) which has a few partitions set up to
> auto-scale, but one partition is set up to not autoscale. The desired
> state is for all of the nodes in this non-autoscaled partition
> (SuspendExcParts=gpu-t4-4x-ondemand) to continue running uninterrupted.
> However, we are finding that nodes periodically end up in the down*
> state and that we cannot get them back into a usable state. This is
> using slurm 19.05.7
> We have a script that runs periodically and checks the state of the
> nodes and takes action based on the state. If the node is in a down
> state, then it gets terminated and if successfully terminated its state
> is set to power_down. There is a short 1 second pause and then for
> those nodes that are in the POWERING_DOWN and not drained state they are
> set to RESUME.
> Sometimes after we start up the node and it's running slurmd we cannot
> get some of these nodes back into a usable slurm state even after
> manually fiddling with its state. It seems to go between idle* and
> down*. But the node is there and we can log into it.
> Does anyone have an idea of what might be going on? And what we can do
> to get these nodes back into a usable (I guess "idle") state?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users