<div dir="ltr"><h2><font size="2"><span style="font-weight:normal">If it is any help, <a href="https://slurm.schedmd.com/sinfo.html">https://slurm.schedmd.com/sinfo.html</a><br></span></font></h2><h2><font size="2"><span style="font-weight:normal">NODE STATE CODES</span></font></h2>
<p>
Node state codes are shortened as required for the field size.
These node states may be followed by a special character to identify
state flags associated with the node.
The following node sufficies and states are used:
</p><dl compact>
<a id="gmail-OPT_*"></a>
<dt><b>*</b></dt><dd>
The node is presently not responding and will not be allocated
any new work. If the node remains non-responsive, it will
be placed in the <b>DOWN</b> state (except in the case of
<b>COMPLETING</b>, <b>DRAINED</b>, <b>DRAINING</b>,
<b>FAIL</b>, <b>FAILING</b> nodes).
</dd></dl><br></div><div class="gmail_extra"><br><div class="gmail_quote">On 18 July 2018 at 09:47, Antony Cleave <span dir="ltr"><<a href="mailto:antony.cleave@gmail.com" target="_blank">antony.cleave@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I've not seen the IDLE* issue before but when my nodes got stuck I've always beena ble to fix them with this:<div><br></div><div><div>[root@cloud01 ~]# scontrol update nodename=cloud01 state=down reason=stuck<br></div><div>[root@cloud01 ~]# scontrol update nodename=cloud01 state=idle</div><div>[root@cloud01 ~]# scontrol update nodename=cloud01 state=power_down</div><div>[root@cloud01 ~]# scontrol update nodename=cloud01 state=power_up</div></div><span class="HOEnZb"><font color="#888888"><div><br></div><div>Antony</div></font></span></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On 17 July 2018 at 18:13, Michael Gutteridge <span dir="ltr"><<a href="mailto:michael.gutteridge@gmail.com" target="_blank">michael.gutteridge@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_default" style="font-family:monospace">Hi</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">I'm running a cluster in a cloud provider and have run up against an odd problem with power save. I've got several hundred nodes that Slurm won't power up even though they appear idle and in the powered-down state. I suspect that they are in a "not-so-idle" state: `scontrol` for all of the nodes which aren't being powered up shows the state as "IDLE*+CLOUD+POWER". The asterisk is throwing me off here- that state doesn't appear to be documented in the scontrol manpage (I want to say I'd seen it discussed on the list, but google searches haven't turned up much yet).</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">The other nodes in the cluster are being powered up and down as we'd expect. It's just these nodes that Slurm doesn't power up. In fact, it appears that the controller doesn't even _try_ to power up the node- the logs (both for the controller with DebugFlags=Power and the power management script logs) don't indicate even an attempt to start a node when requested.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">I haven't figured a way to reliably reset the nodes to "IDLE". Some relevant configs are:</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace"><div class="gmail_default">SchedulerType=sched/backfill</div><div class="gmail_default">SelectType=select/cons_res<br></div><div class="gmail_default">SelectTypeParameters=CR_CPU</div><div>SuspendProgram=/var/lib/slurm-<wbr>llnl/suspend<br></div></div><div class="gmail_default" style="font-family:monospace"><div class="gmail_default">SuspendTime=300</div><div class="gmail_default">SuspendRate=10</div><div class="gmail_default">ResumeRate=10</div><div class="gmail_default">ResumeProgram=/var/lib/slurm-l<wbr>lnl/resume</div><div class="gmail_default">ResumeTimeout=300</div><div class="gmail_default">BatchStartTimeout=300</div><div><br></div></div><div class="gmail_default" style="font-family:monospace">A typical node is configured thus:</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace"><div class="gmail_default">NodeName=nodef74 NodeAddr=<a href="http://nodef74.fhcrc.org" target="_blank">nodef74.fhcrc.org</a> Feature=c5.2xlarge CPUs=4 RealMemory=16384 Weight=40 State=CLOUD</div><div><br></div><div>Thanks for your time- any advice or hints are greatly appreciated.</div><span class="m_-2333848826916984639HOEnZb"><font color="#888888"><div><br></div><div>Michael</div><div><br></div><div><br></div></font></span></div><input name="virtru-metadata" value="{"email-policy":{"state":"closed","expirationUnit":"days","disableCopyPaste":false,"disablePrint":false,"disableForwarding":false,"expires":false,"isManaged":false},"attachments":{},"compose-window":{"secure":false}}" type="hidden"></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>