<div dir="ltr">The SLURM controller AND all the compute nodes need to know who all is in the cluster. If you want to add a node or it changes IP addresses, you need to let all the nodes know about this which, for me, usually means restarting slurmd on the compute nodes. <div><br></div><div>I just say this because I get caught by this all the time if I add some nodes and for whatever reason miss restarting one of the slurmd processes on the compute nodes.</div><div><br></div><div>Tim</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, May 19, 2021 at 9:17 PM Herc Silverstein <<a href="mailto:herc.silverstein@schrodinger.com" target="_blank">herc.silverstein@schrodinger.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
We have a cluster (in Google gcp) which has a few partitions set up to <br>
auto-scale, but one partition is set up to not autoscale. The desired <br>
state is for all of the nodes in this non-autoscaled partition <br>
(SuspendExcParts=gpu-t4-4x-ondemand) to continue running uninterrupted. <br>
However, we are finding that nodes periodically end up in the down* <br>
state and that we cannot get them back into a usable state. This is <br>
using slurm 19.05.7<br>
<br>
We have a script that runs periodically and checks the state of the <br>
nodes and takes action based on the state. If the node is in a down <br>
state, then it gets terminated and if successfully terminated its state <br>
is set to power_down. There is a short 1 second pause and then for <br>
those nodes that are in the POWERING_DOWN and not drained state they are <br>
set to RESUME.<br>
<br>
Sometimes after we start up the node and it's running slurmd we cannot <br>
get some of these nodes back into a usable slurm state even after <br>
manually fiddling with its state. It seems to go between idle* and <br>
down*. But the node is there and we can log into it.<br>
<br>
Does anyone have an idea of what might be going on? And what we can do <br>
to get these nodes back into a usable (I guess "idle") state?<br>
<br>
Thanks,<br>
<br>
Herc<br>
<br>
<br>
<br>
</blockquote></div>