[slurm-users] Power save doesn't start nodes
Michael Gutteridge
michael.gutteridge at gmail.com
Wed Jul 18 11:43:58 MDT 2018
John: thanks for the link. Curiously, sinfo doesn't show the asterisk, but
has it documented. scontrol shows the asterisk and doesn't document it...
at least for the state my cluster is in.
Antony: Thanks for the steps- I tried it out, but there was no change. It
seems like it should do the trick, but the controller would never run the
resume or suspend script. The logs indicated that "the nodes already up"
or similar. I ran the resume script manually and that seemed to have
restored it to service.
However, it may have just been that one node (I had been working almost
exclusively with this one node and I may have put it in a weird state). On
these other nodes it now seems sufficient to just "scontrol
state=power_up". My controller may have just been in a bad state, or
perhaps it was just a couple bad apples I happened to pick to work with.
Anyway: it seems to be working again. Thanks for the help and advice.
Michael
On Wed, Jul 18, 2018 at 1:07 AM John Hearns <hearnsj at googlemail.com> wrote:
> If it is any help, https://slurm.schedmd.com/sinfo.html
> NODE STATE CODES
>
> Node state codes are shortened as required for the field size. These node
> states may be followed by a special character to identify state flags
> associated with the node. The following node sufficies and states are used:
> *** The node is presently not responding and will not be allocated any
> new work. If the node remains non-responsive, it will be placed in the
> *DOWN* state (except in the case of *COMPLETING*, *DRAINED*, *DRAINING*,
> *FAIL*, *FAILING* nodes).
>
> On 18 July 2018 at 09:47, Antony Cleave <antony.cleave at gmail.com> wrote:
>
>> I've not seen the IDLE* issue before but when my nodes got stuck I've
>> always beena ble to fix them with this:
>>
>> [root at cloud01 ~]# scontrol update nodename=cloud01 state=down
>> reason=stuck
>> [root at cloud01 ~]# scontrol update nodename=cloud01 state=idle
>> [root at cloud01 ~]# scontrol update nodename=cloud01 state=power_down
>> [root at cloud01 ~]# scontrol update nodename=cloud01 state=power_up
>>
>> Antony
>>
>> On 17 July 2018 at 18:13, Michael Gutteridge <
>> michael.gutteridge at gmail.com> wrote:
>>
>>> Hi
>>>
>>> I'm running a cluster in a cloud provider and have run up against an odd
>>> problem with power save. I've got several hundred nodes that Slurm won't
>>> power up even though they appear idle and in the powered-down state. I
>>> suspect that they are in a "not-so-idle" state: `scontrol` for all of the
>>> nodes which aren't being powered up shows the state as
>>> "IDLE*+CLOUD+POWER". The asterisk is throwing me off here- that state
>>> doesn't appear to be documented in the scontrol manpage (I want to say I'd
>>> seen it discussed on the list, but google searches haven't turned up much
>>> yet).
>>>
>>> The other nodes in the cluster are being powered up and down as we'd
>>> expect. It's just these nodes that Slurm doesn't power up. In fact, it
>>> appears that the controller doesn't even _try_ to power up the node- the
>>> logs (both for the controller with DebugFlags=Power and the power
>>> management script logs) don't indicate even an attempt to start a node when
>>> requested.
>>>
>>> I haven't figured a way to reliably reset the nodes to "IDLE". Some
>>> relevant configs are:
>>>
>>> SchedulerType=sched/backfill
>>> SelectType=select/cons_res
>>> SelectTypeParameters=CR_CPU
>>> SuspendProgram=/var/lib/slurm-llnl/suspend
>>> SuspendTime=300
>>> SuspendRate=10
>>> ResumeRate=10
>>> ResumeProgram=/var/lib/slurm-llnl/resume
>>> ResumeTimeout=300
>>> BatchStartTimeout=300
>>>
>>> A typical node is configured thus:
>>>
>>> NodeName=nodef74 NodeAddr=nodef74.fhcrc.org Feature=c5.2xlarge CPUs=4
>>> RealMemory=16384 Weight=40 State=CLOUD
>>>
>>> Thanks for your time- any advice or hints are greatly appreciated.
>>>
>>> Michael
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180718/b4bd9f21/attachment.html>
More information about the slurm-users
mailing list