[slurm-users] Power Save: When is RESUME an invalid node state?
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Wed Dec 6 09:30:13 UTC 2023
Hi Xavier,
On 12/6/23 09:28, Xaver Stiensmeier wrote:
> using https://slurm.schedmd.com/power_save.html we had one case out of
> many (>242) node starts that resulted in
>
> |slurm_update error: Invalid node state specified|
>
> when we called:
>
> |scontrol update NodeName="$1" state=RESUME reason=FailedStartup|
>
> in the Fail script. We run this to make 100% sure that the instances -
> that are created on demand - are again `~idle` after being removed by the
> fail program. They are set to RESUME before the actual instance gets
> destroyed. I remember that I had this case manually before, but I don't
> remember when it occurs.
>
> Maybe someone has a great idea how to tackle this problem.
Probably you can't assign a "reason" when you update a node with
state=RESUME. The scontrol manual page says:
Reason=<reason> Identify the reason the node is in a "DOWN", "DRAINED",
"DRAINING", "FAILING" or "FAIL" state.
Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save
IHTH,
Ole
More information about the slurm-users
mailing list