[slurm-users] Power Save: When is RESUME an invalid node state?

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Wed Dec 6 09:30:13 UTC 2023


Hi Xavier,

On 12/6/23 09:28, Xaver Stiensmeier wrote:
> using https://slurm.schedmd.com/power_save.html we had one case out of 
> many (>242) node starts that resulted in
> 
> |slurm_update error: Invalid node state specified|
> 
> when we called:
> 
> |scontrol update NodeName="$1" state=RESUME reason=FailedStartup|
> 
> in the Fail script. We run this to make 100% sure that the instances - 
> that are created on demand - are again `~idle` after being removed by the 
> fail program. They are set to RESUME before the actual instance gets 
> destroyed. I remember that I had this case manually before, but I don't 
> remember when it occurs.
> 
> Maybe someone has a great idea how to tackle this problem.

Probably you can't assign a "reason" when you update a node with 
state=RESUME.  The scontrol manual page says:

Reason=<reason> Identify the reason the node is in a "DOWN", "DRAINED", 
"DRAINING", "FAILING" or "FAIL" state.

Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save

IHTH,
Ole




More information about the slurm-users mailing list