[slurm-users] [slurm 20.02.3] don't suspend nodes in down state

Angelos Ching angelosching at clustertech.com
Mon Aug 24 10:24:17 UTC 2020


I have some logic of making sure that the node to be acted on is in idle state in SuspendProgram and its helper programs, before power action is performed.

Best regards,
Angelos
(Sent from mobile, please pardon me for typos and cursoriness.)

> 2020/08/24 17:42、Jacek Budzowski <j.budzowski at cyfronet.pl>のメール:
> 
> 
> Dear Herbert,
> 
> In our installation we also had this problem.
> Unfortunately we didn't find more elegant solution than change in Slurm code (and recompiling slurmctld).
> Here is the patch we use to prevent DOWN nodes to be suspended:
> 
> diff --git a/src/slurmctld/power_save.c b/src/slurmctld/power_save.c
> index 1f8d77c..752b404 100644
> --- a/src/slurmctld/power_save.c
> +++ b/src/slurmctld/power_save.c
> @@ -368,7 +368,7 @@ static void _do_power_work(time_t now)
>                 /* Suspend nodes as appropriate */
>                 if ((susp_state == 0)                                   &&
>                     ((suspend_rate == 0) || (suspend_cnt < suspend_rate)) &&
> -                   (IS_NODE_IDLE(node_ptr) || IS_NODE_DOWN(node_ptr))  &&
> +                   (IS_NODE_IDLE(node_ptr))                            &&
>                     (node_ptr->sus_job_cnt == 0)                        &&
>                     (!IS_NODE_COMPLETING(node_ptr))                     &&
>                     (!IS_NODE_POWER_UP(node_ptr))                       &&
> 
> 
> Best regards,
> Jacek Budzowski
> 
> W dniu pon, 24.08.2020 o godzinie 08∶52 +0000, użytkownik Steininger, Herbert napisał:
>> Hi,
>> 
>> how can I prevent slurm, to suspend nodes, which I have set to down state for maintenance?
>> I know about "SuspendExcNodes", but this doesn't seem the right way, to roll out the slurm.conf every time this changes.
>> Is there a state that I can set so that the nodes doesn't get suspended?
>> 
>> It happened a few times that I was doing some stuff on a server and after our idle time (1h) slurm decided to suspend the node.
>> 
>> TIA,
>> Herbert
>> 
> -- 
> Jacek Budzowski
> System administrator
> ACC Cyfronet AGH




More information about the slurm-users mailing list