[slurm-users] Power save doesn't start nodes
Antony Cleave
antony.cleave at gmail.com
Wed Jul 18 01:47:37 MDT 2018
I've not seen the IDLE* issue before but when my nodes got stuck I've
always beena ble to fix them with this:
[root at cloud01 ~]# scontrol update nodename=cloud01 state=down reason=stuck
[root at cloud01 ~]# scontrol update nodename=cloud01 state=idle
[root at cloud01 ~]# scontrol update nodename=cloud01 state=power_down
[root at cloud01 ~]# scontrol update nodename=cloud01 state=power_up
Antony
On 17 July 2018 at 18:13, Michael Gutteridge <michael.gutteridge at gmail.com>
wrote:
> Hi
>
> I'm running a cluster in a cloud provider and have run up against an odd
> problem with power save. I've got several hundred nodes that Slurm won't
> power up even though they appear idle and in the powered-down state. I
> suspect that they are in a "not-so-idle" state: `scontrol` for all of the
> nodes which aren't being powered up shows the state as
> "IDLE*+CLOUD+POWER". The asterisk is throwing me off here- that state
> doesn't appear to be documented in the scontrol manpage (I want to say I'd
> seen it discussed on the list, but google searches haven't turned up much
> yet).
>
> The other nodes in the cluster are being powered up and down as we'd
> expect. It's just these nodes that Slurm doesn't power up. In fact, it
> appears that the controller doesn't even _try_ to power up the node- the
> logs (both for the controller with DebugFlags=Power and the power
> management script logs) don't indicate even an attempt to start a node when
> requested.
>
> I haven't figured a way to reliably reset the nodes to "IDLE". Some
> relevant configs are:
>
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU
> SuspendProgram=/var/lib/slurm-llnl/suspend
> SuspendTime=300
> SuspendRate=10
> ResumeRate=10
> ResumeProgram=/var/lib/slurm-llnl/resume
> ResumeTimeout=300
> BatchStartTimeout=300
>
> A typical node is configured thus:
>
> NodeName=nodef74 NodeAddr=nodef74.fhcrc.org Feature=c5.2xlarge CPUs=4
> RealMemory=16384 Weight=40 State=CLOUD
>
> Thanks for your time- any advice or hints are greatly appreciated.
>
> Michael
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180718/835fd6b1/attachment.html>
More information about the slurm-users
mailing list