<div dir="ltr">The SLURM controller AND all the compute nodes need to know who all is in the cluster. If you want to add a node or it changes IP addresses, you need to let all the nodes know about this which, for me, usually means restarting slurmd on the compute nodes.   <div><br></div><div>I just say this because I get caught by this all the time if I add some nodes and for whatever reason miss restarting one of the slurmd processes on the compute nodes.</div><div><br></div><div>Tim</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, May 19, 2021 at 9:17 PM Herc Silverstein <<a href="mailto:herc.silverstein@schrodinger.com" target="_blank">herc.silverstein@schrodinger.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<br>

We have a cluster (in Google gcp) which has a few partitions set up to <br>

auto-scale, but one partition is set up to not autoscale. The desired <br>

state is for all of the nodes in this non-autoscaled partition <br>

(SuspendExcParts=gpu-t4-4x-ondemand) to continue running uninterrupted.  <br>

However, we are finding that nodes periodically end up in the down* <br>

state and that we cannot get them back into a usable state.  This is <br>

using slurm 19.05.7<br>

<br>

We have a script that runs periodically and checks the state of the <br>

nodes and takes action based on the state.  If the node is in a down <br>

state, then it gets terminated and if successfully terminated its state <br>

is set to power_down.  There is a short 1 second pause and then for <br>

those nodes that are in the POWERING_DOWN and not drained state they are <br>

set to RESUME.<br>

<br>

Sometimes after we start up the node and it's running slurmd we cannot <br>

get some of these nodes back into a usable slurm state even after <br>

manually fiddling with its state.   It seems to go between idle* and <br>

down*.  But the node is there and we can log into it.<br>

<br>

Does anyone have an idea of what might be going on?  And what we can do <br>

to get these nodes back into a usable (I guess "idle") state?<br>

<br>

Thanks,<br>

<br>

Herc<br>

<br>

<br>

<br>

</blockquote></div>