[slurm-users] nodes going to down* and getting stuck in that state

Thu May 20 12:06:29 UTC 2021

We had a situation recently where a desktop was turned off for a week.  When
we brought it back online (in a different part of the network with a different
IP), everything came up fine (slurmd and munge).

But it kept going into DOWN* for no apparent reason (neither daemon-wise nor
log-wise).

As part of another issue, we "scontrol reconfigure"d (and, as it turned out,
restarted slurmctld as well).  THAT seems to have corrected it going to
DOWN*.  It switched to IDLE and stayed there.

Not that this necessarily has anything to do with your issue...
But it does sound similar.

-- 
- Bill
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
Bill Benedetto     <bbenedetto at goodyear.com>    The Goodyear Tire & Rubber Co.
I don't speak for Goodyear and they don't speak for me.  We're both happy.
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

>>> Herc Silverstein writes:

  Herc> We have a cluster (in Google gcp) which has a few partitions set up to
  Herc> auto-scale, but one partition is set up to not autoscale. The desired
  Herc> state is for all of the nodes in this non-autoscaled partition
  Herc> (SuspendExcParts=3Dgpu-t4-4x-ondemand) to continue running uninterrupted.
  Herc> However, we are finding that nodes periodically end up in the down*
  Herc> state and that we cannot get them back into a usable state.  This is
  Herc> using slurm 19.05.7
  Herc> 
  Herc> We have a script that runs periodically and checks the state of the
  Herc> nodes and takes action based on the state.  If the node is in a down
  Herc> state, then it gets terminated and if successfully terminated its state
  Herc> is set to power_down.  There is a short 1 second pause and then for
  Herc> those nodes that are in the POWERING_DOWN and not drained state they are
  Herc> set to RESUME.
  Herc> 
  Herc> Sometimes after we start up the node and it's running slurmd we cannot
  Herc> get some of these nodes back into a usable slurm state even after
  Herc> manually fiddling with its state.   It seems to go between idle* and
  Herc> down*.  But the node is there and we can log into it.
  Herc> 
  Herc> Does anyone have an idea of what might be going on?  And what we can do
  Herc> to get these nodes back into a usable (I guess "idle") state?
  Herc> 
  Herc> Thanks,
  Herc> 
  Herc> Herc
  Herc> 
  Herc> 
  Herc>