[slurm-users] nodes going to down* and getting stuck in that state
bbenedetto at goodyear.com
bbenedetto at goodyear.com
Thu May 20 12:06:29 UTC 2021
We had a situation recently where a desktop was turned off for a week. When
we brought it back online (in a different part of the network with a different
IP), everything came up fine (slurmd and munge).
But it kept going into DOWN* for no apparent reason (neither daemon-wise nor
log-wise).
As part of another issue, we "scontrol reconfigure"d (and, as it turned out,
restarted slurmctld as well). THAT seems to have corrected it going to
DOWN*. It switched to IDLE and stayed there.
Not that this necessarily has anything to do with your issue...
But it does sound similar.
--
- Bill
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
Bill Benedetto <bbenedetto at goodyear.com> The Goodyear Tire & Rubber Co.
I don't speak for Goodyear and they don't speak for me. We're both happy.
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
>>> Herc Silverstein writes:
Herc> We have a cluster (in Google gcp) which has a few partitions set up to
Herc> auto-scale, but one partition is set up to not autoscale. The desired
Herc> state is for all of the nodes in this non-autoscaled partition
Herc> (SuspendExcParts=3Dgpu-t4-4x-ondemand) to continue running uninterrupted.
Herc> However, we are finding that nodes periodically end up in the down*
Herc> state and that we cannot get them back into a usable state. This is
Herc> using slurm 19.05.7
Herc>
Herc> We have a script that runs periodically and checks the state of the
Herc> nodes and takes action based on the state. If the node is in a down
Herc> state, then it gets terminated and if successfully terminated its state
Herc> is set to power_down. There is a short 1 second pause and then for
Herc> those nodes that are in the POWERING_DOWN and not drained state they are
Herc> set to RESUME.
Herc>
Herc> Sometimes after we start up the node and it's running slurmd we cannot
Herc> get some of these nodes back into a usable slurm state even after
Herc> manually fiddling with its state. It seems to go between idle* and
Herc> down*. But the node is there and we can log into it.
Herc>
Herc> Does anyone have an idea of what might be going on? And what we can do
Herc> to get these nodes back into a usable (I guess "idle") state?
Herc>
Herc> Thanks,
Herc>
Herc> Herc
Herc>
Herc>
Herc>
More information about the slurm-users
mailing list