[slurm-users] Power Save: When is RESUME an invalid node state?

Xaver Stiensmeier xaverstiensmeier at gmx.de
Wed Dec 6 09:54:03 UTC 2023


Hi Ole,

I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It might be ignored though. You can also give a reason when
defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
not always giving all information. We run our solution for about a year
now so I don't think there's a general problem (as in something that
necessarily occurs) with the command. But I will take a closer look. I
really feel like it has to be something more conditional though as
otherwise the error would've occurred more often (i.e. every time when
handling a fail and the command is execute).

Your repository would've been really helpful for me when we started
implementing the cloud scheduling, but I feel like we have implemented
most things you mention there already. But I will take a look at
`DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find
out; SLURM plans/planned to change that in the future (cloud key behaves
different than any other key in PrivateData). Of course our setup
differs a little in the details.

Best regards
Xaver

On 06.12.23 10:30, Ole Holm Nielsen wrote:
> Hi Xavier,
>
> On 12/6/23 09:28, Xaver Stiensmeier wrote:
>> using https://slurm.schedmd.com/power_save.html we had one case out
>> of many (>242) node starts that resulted in
>>
>> |slurm_update error: Invalid node state specified|
>>
>> when we called:
>>
>> |scontrol update NodeName="$1" state=RESUME reason=FailedStartup|
>>
>> in the Fail script. We run this to make 100% sure that the instances
>> - that are created on demand - are again `~idle` after being removed
>> by the fail program. They are set to RESUME before the actual
>> instance gets destroyed. I remember that I had this case manually
>> before, but I don't remember when it occurs.
>>
>> Maybe someone has a great idea how to tackle this problem.
>
> Probably you can't assign a "reason" when you update a node with
> state=RESUME.  The scontrol manual page says:
>
> Reason=<reason> Identify the reason the node is in a "DOWN",
> "DRAINED", "DRAINING", "FAILING" or "FAIL" state.
>
> Maybe you will find some useful hints in my Wiki page
> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
>
> and in my power saving tools at
> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save
>
> IHTH,
> Ole
>
>



More information about the slurm-users mailing list