[slurm-users] Power Save: When is RESUME an invalid node state?
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Wed Dec 6 10:09:54 UTC 2023
Hi Xaver,
Your version of Slurm may matter for your power saving experience. Do you
run an updated version?
/Ole
On 12/6/23 10:54, Xaver Stiensmeier wrote:
> Hi Ole,
>
> I will double check, but I am very sure that giving a reason is possible
> as it has been done at least 20 other times without error during that
> exact run. It might be ignored though. You can also give a reason when
> defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
> not always giving all information. We run our solution for about a year
> now so I don't think there's a general problem (as in something that
> necessarily occurs) with the command. But I will take a closer look. I
> really feel like it has to be something more conditional though as
> otherwise the error would've occurred more often (i.e. every time when
> handling a fail and the command is execute).
> >>
>> IHTH,
>> Ole
>>
>>
>
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: Ole.H.Nielsen at fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620
> Your repository would've been really helpful for me when we started>>
>> IHTH,
>> Ole
>>
>>
>
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: Ole.H.Nielsen at fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620
> implementing the cloud scheduling, but I feel like we have implemented
> most things you mention there already. But I will take a look at
> `DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find
> out; SLURM plans/planned to change that in the future (cloud key behaves
> different than any other key in PrivateData). Of course our setup
> differs a little in the details.
>
> Best regards
> Xaver
>
> On 06.12.23 10:30, Ole Holm Nielsen wrote:
>> Hi Xavier,
>>
>> On 12/6/23 09:28, Xaver Stiensmeier wrote:
>>> using https://slurm.schedmd.com/power_save.html we had one case out
>>> of many (>242) node starts that resulted in
>>>
>>> |slurm_update error: Invalid node state specified|
>>>
>>> when we called:
>>>
>>> |scontrol update NodeName="$1" state=RESUME reason=FailedStartup|
>>>
>>> in the Fail script. We run this to make 100% sure that the instances
>>> - that are created on demand - are again `~idle` after being removed
>>> by the fail program. They are set to RESUME before the actual
>>> instance gets destroyed. I remember that I had this case manually
>>> before, but I don't remember when it occurs.
>>>
>>> Maybe someone has a great idea how to tackle this problem.
>>
>> Probably you can't assign a "reason" when you update a node with
>> state=RESUME. The scontrol manual page says:
>>
>> Reason=<reason> Identify the reason the node is in a "DOWN",
>> "DRAINED", "DRAINING", "FAILING" or "FAIL" state.
>>
>> Maybe you will find some useful hints in my Wiki page
>> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
>>
>> and in my power saving tools at
>> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save
More information about the slurm-users
mailing list