[slurm-users] Power save doesn't start nodes

Michael Gutteridge michael.gutteridge at gmail.com
Tue Jul 17 11:13:54 MDT 2018


Hi

I'm running a cluster in a cloud provider and have run up against an odd
problem with power save.  I've got several hundred nodes that Slurm won't
power up even though they appear idle and in the powered-down state.  I
suspect that they are in a "not-so-idle" state: `scontrol` for all of the
nodes which aren't being powered up shows the state as
"IDLE*+CLOUD+POWER".  The asterisk is throwing me off here- that state
doesn't appear to be documented in the scontrol manpage (I want to say I'd
seen it discussed on the list, but google searches haven't turned up much
yet).

The other nodes in the cluster are being powered up and down as we'd
expect.  It's just these nodes that Slurm doesn't power up.  In fact, it
appears that the controller doesn't even _try_ to power up the node- the
logs (both for the controller with DebugFlags=Power and the power
management script logs) don't indicate even an attempt to start a node when
requested.

I haven't figured a way to reliably reset the nodes to "IDLE".  Some
relevant configs are:

SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SuspendProgram=/var/lib/slurm-llnl/suspend
SuspendTime=300
SuspendRate=10
ResumeRate=10
ResumeProgram=/var/lib/slurm-llnl/resume
ResumeTimeout=300
BatchStartTimeout=300

A typical node is configured thus:

NodeName=nodef74 NodeAddr=nodef74.fhcrc.org Feature=c5.2xlarge CPUs=4
RealMemory=16384 Weight=40 State=CLOUD

Thanks for your time- any advice or hints are greatly appreciated.

Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180717/04112911/attachment.html>


More information about the slurm-users mailing list