[slurm-users] Power save doesn't start nodes

Wed Jul 18 02:04:23 MDT 2018

If it is any help,  https://slurm.schedmd.com/sinfo.html
NODE STATE CODES

Node state codes are shortened as required for the field size. These node
states may be followed by a special character to identify state flags
associated with the node. The following node sufficies and states are used:
*** The node is presently not responding and will not be allocated any new
work. If the node remains non-responsive, it will be placed in the *DOWN*
state (except in the case of *COMPLETING*, *DRAINED*, *DRAINING*, *FAIL*,
*FAILING* nodes).

On 18 July 2018 at 09:47, Antony Cleave <antony.cleave at gmail.com> wrote:

> I've not seen the IDLE* issue before but when my nodes got stuck I've
> always beena ble to fix them with this:
>
> [root at cloud01 ~]# scontrol update nodename=cloud01 state=down reason=stuck
> [root at cloud01 ~]# scontrol update nodename=cloud01 state=idle
> [root at cloud01 ~]# scontrol update nodename=cloud01 state=power_down
> [root at cloud01 ~]# scontrol update nodename=cloud01 state=power_up
>
> Antony
>
> On 17 July 2018 at 18:13, Michael Gutteridge <michael.gutteridge at gmail.com
> > wrote:
>
>> Hi
>>
>> I'm running a cluster in a cloud provider and have run up against an odd
>> problem with power save.  I've got several hundred nodes that Slurm won't
>> power up even though they appear idle and in the powered-down state.  I
>> suspect that they are in a "not-so-idle" state: `scontrol` for all of the
>> nodes which aren't being powered up shows the state as
>> "IDLE*+CLOUD+POWER".  The asterisk is throwing me off here- that state
>> doesn't appear to be documented in the scontrol manpage (I want to say I'd
>> seen it discussed on the list, but google searches haven't turned up much
>> yet).
>>
>> The other nodes in the cluster are being powered up and down as we'd
>> expect.  It's just these nodes that Slurm doesn't power up.  In fact, it
>> appears that the controller doesn't even _try_ to power up the node- the
>> logs (both for the controller with DebugFlags=Power and the power
>> management script logs) don't indicate even an attempt to start a node when
>> requested.
>>
>> I haven't figured a way to reliably reset the nodes to "IDLE".  Some
>> relevant configs are:
>>
>> SchedulerType=sched/backfill
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU
>> SuspendProgram=/var/lib/slurm-llnl/suspend
>> SuspendTime=300
>> SuspendRate=10
>> ResumeRate=10
>> ResumeProgram=/var/lib/slurm-llnl/resume
>> ResumeTimeout=300
>> BatchStartTimeout=300
>>
>> A typical node is configured thus:
>>
>> NodeName=nodef74 NodeAddr=nodef74.fhcrc.org Feature=c5.2xlarge CPUs=4
>> RealMemory=16384 Weight=40 State=CLOUD
>>
>> Thanks for your time- any advice or hints are greatly appreciated.
>>
>> Michael
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180718/ff57321a/attachment.html>