[slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

Wed Oct 25 13:47:11 UTC 2023

Hi Ole,

thanks for your reply.

The curious thing is that when I run "scontrol reboot nextstate=RESUME 
<node>", the drain flag of that node is not set (sinfo shows mix@ and 
"scontrol show node <node>" shows no DRAIN in State, just 
MIXED+REBOOT_REQUESTED), yet no jobs are scheduled on that node until 
reboot. If I specifically request that node for a job with "-w <node>", 
I get "Nodes required for job are DOWN, DRAINED or reserved for jobs in 
higher priority partitions".

Not using nextstate=RESUME is inconvenient for me as sometimes we have 
parts of our cluster drained and I would like to run a single command 
that reboots all non-drained nodes once they become idle and all drained 
nodes immediately, resuming them once they are done reinstalling.

Best,

Tim

On 25.10.23 14:59, Ole Holm Nielsen wrote:
> Hi Tim,
>
> I think the scontrol manual page explains the "scontrol reboot" function
> fairly well:
>
>>         reboot      [ASAP]      [nextstate={RESUME|DOWN}]     [reason=<reason>]
>>         {ALL|<NodeList>}
>>                Reboot the nodes in the system when they become idle  using  the
>>                RebootProgram  as  configured  in Slurm's slurm.conf file.  Each
>>                node will have the "REBOOT" flag added to its node state.  After
>>                a  node  reboots  and  the  slurmd  daemon  starts up again, the
>>                HealthCheckProgram will run once. Then, the slurmd  daemon  will
>>                register  itself with the slurmctld daemon and the "REBOOT" flag
>>                will be cleared.  The node's "DRAIN" state flag will be  cleared
>>                if  the reboot was "ASAP", nextstate=resume or down.  The "ASAP"
>>                option adds the "DRAIN" flag to each  node's  state,  preventing
>>                additional  jobs  from running on the node so it can be rebooted
>>                and returned to service  "As  Soon  As  Possible"  (i.e.  ASAP).
> It seems to be implicitly understood that if nextstate is specified, this
> implies setting the "DRAIN" state flag:
>
>> The node's "DRAIN" state flag will be  cleared if the reboot was "ASAP", nextstate=resume or down.
> You can verify the node's DRAIN flag with "scontrol show node <nodename>".
>
> IMHO, if you want nodes to continue accepting new jobs, then nextstate is
> irrelevant.
>
> We always use "reboot ASAP" because our cluster is usually so busy that
> nodes never become idle if left to themselves :-)
>
> FYI: We regularly make package updates and firmware updates using the
> "scontrol reboot asap" method which is explained in this script:
> https://github.com/OleHolmNielsen/Slurm_tools/blob/master/nodes/update.sh
>
> Best regards,
> Ole,
> Ole
>
>
> On 10/25/23 13:39, Tim Schneider wrote:
>> Hi Chris,
>>
>> thanks a lot for your response.
>>
>> I just realized that I made a mistake in my post. In the section you cite,
>> the command is supposed to be "scontrol reboot nextstate=RESUME" (without
>> ASAP).
>>
>> So to clarify: my problem is that if I type "scontrol reboot
>> nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On
>> the other hand, if I type "scontrol reboot", jobs continue to get
>> scheduled, which is what I want. I just don't understand, why setting
>> nextstate results in the nodes not accepting jobs anymore.
>>
>> My usecase is similar to the one you describe. We use the ASAP option when
>> we install a new image to ensure that from the point of the reinstallation
>> onwards, all jobs end up on nodes with the new configuration only.
>> However, in some cases when we do only minor changes to the image
>> configuration, we prefer to cause as little disruption as possible and
>> just reinstall the nodes whenever they are idle. Here, being able to set
>> nextstate=RESUME is useful, since we usually want the nodes to resume
>> after reinstallation, no matter what their previous state was.
>>
>> Hope that clears it up and sorry for the confusion!
>>
>> Best,
>>
>> tim
>>
>> On 25.10.23 02:10, Christopher Samuel wrote:
>>> On 10/24/23 12:39, Tim Schneider wrote:
>>>
>>>> Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME
>>>> <node>", the node goes in "mix@" state (not drain), but no new jobs get
>>>> scheduled until the node reboots. Essentially I get draining behavior,
>>>> even though the node's state is not "drain". Note that this behavior is
>>>> caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled
>>>> as expected. Does anyone have an idea why that could be?
>>> The intent of the "ASAP` flag for "scontrol reboot" is to not let any
>>> more jobs onto a node until it has rebooted.
>>>
>>> IIRC that was from work we sponsored, the idea being that (for how our
>>> nodes are managed) we would build new images with the latest software
>>> stack, test them on a separate test system and then once happy bring
>>> them over to the production system and do an "scontrol reboot ASAP
>>> nextstate=resume reason=... $NODES" to ensure that from that point
>>> onwards no new jobs would start in the old software configuration, only
>>> the new one.
>>>
>>> Also slurmctld would know that these nodes are due to come back in
>>> "ResumeTimeout" seconds after the reboot is issued and so could plan for
>>> them as part of scheduling large jobs, rather than thinking there was no
>>> way it could do so and letting lots of smaller jobs get in the way.