[slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set
Tim Schneider
tim.schneider1 at tu-darmstadt.de
Wed Oct 25 13:47:11 UTC 2023
Hi Ole,
thanks for your reply.
The curious thing is that when I run "scontrol reboot nextstate=RESUME
<node>", the drain flag of that node is not set (sinfo shows mix@ and
"scontrol show node <node>" shows no DRAIN in State, just
MIXED+REBOOT_REQUESTED), yet no jobs are scheduled on that node until
reboot. If I specifically request that node for a job with "-w <node>",
I get "Nodes required for job are DOWN, DRAINED or reserved for jobs in
higher priority partitions".
Not using nextstate=RESUME is inconvenient for me as sometimes we have
parts of our cluster drained and I would like to run a single command
that reboots all non-drained nodes once they become idle and all drained
nodes immediately, resuming them once they are done reinstalling.
Best,
Tim
On 25.10.23 14:59, Ole Holm Nielsen wrote:
> Hi Tim,
>
> I think the scontrol manual page explains the "scontrol reboot" function
> fairly well:
>
>> reboot [ASAP] [nextstate={RESUME|DOWN}] [reason=<reason>]
>> {ALL|<NodeList>}
>> Reboot the nodes in the system when they become idle using the
>> RebootProgram as configured in Slurm's slurm.conf file. Each
>> node will have the "REBOOT" flag added to its node state. After
>> a node reboots and the slurmd daemon starts up again, the
>> HealthCheckProgram will run once. Then, the slurmd daemon will
>> register itself with the slurmctld daemon and the "REBOOT" flag
>> will be cleared. The node's "DRAIN" state flag will be cleared
>> if the reboot was "ASAP", nextstate=resume or down. The "ASAP"
>> option adds the "DRAIN" flag to each node's state, preventing
>> additional jobs from running on the node so it can be rebooted
>> and returned to service "As Soon As Possible" (i.e. ASAP).
> It seems to be implicitly understood that if nextstate is specified, this
> implies setting the "DRAIN" state flag:
>
>> The node's "DRAIN" state flag will be cleared if the reboot was "ASAP", nextstate=resume or down.
> You can verify the node's DRAIN flag with "scontrol show node <nodename>".
>
> IMHO, if you want nodes to continue accepting new jobs, then nextstate is
> irrelevant.
>
> We always use "reboot ASAP" because our cluster is usually so busy that
> nodes never become idle if left to themselves :-)
>
> FYI: We regularly make package updates and firmware updates using the
> "scontrol reboot asap" method which is explained in this script:
> https://github.com/OleHolmNielsen/Slurm_tools/blob/master/nodes/update.sh
>
> Best regards,
> Ole,
> Ole
>
>
> On 10/25/23 13:39, Tim Schneider wrote:
>> Hi Chris,
>>
>> thanks a lot for your response.
>>
>> I just realized that I made a mistake in my post. In the section you cite,
>> the command is supposed to be "scontrol reboot nextstate=RESUME" (without
>> ASAP).
>>
>> So to clarify: my problem is that if I type "scontrol reboot
>> nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On
>> the other hand, if I type "scontrol reboot", jobs continue to get
>> scheduled, which is what I want. I just don't understand, why setting
>> nextstate results in the nodes not accepting jobs anymore.
>>
>> My usecase is similar to the one you describe. We use the ASAP option when
>> we install a new image to ensure that from the point of the reinstallation
>> onwards, all jobs end up on nodes with the new configuration only.
>> However, in some cases when we do only minor changes to the image
>> configuration, we prefer to cause as little disruption as possible and
>> just reinstall the nodes whenever they are idle. Here, being able to set
>> nextstate=RESUME is useful, since we usually want the nodes to resume
>> after reinstallation, no matter what their previous state was.
>>
>> Hope that clears it up and sorry for the confusion!
>>
>> Best,
>>
>> tim
>>
>> On 25.10.23 02:10, Christopher Samuel wrote:
>>> On 10/24/23 12:39, Tim Schneider wrote:
>>>
>>>> Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME
>>>> <node>", the node goes in "mix@" state (not drain), but no new jobs get
>>>> scheduled until the node reboots. Essentially I get draining behavior,
>>>> even though the node's state is not "drain". Note that this behavior is
>>>> caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled
>>>> as expected. Does anyone have an idea why that could be?
>>> The intent of the "ASAP` flag for "scontrol reboot" is to not let any
>>> more jobs onto a node until it has rebooted.
>>>
>>> IIRC that was from work we sponsored, the idea being that (for how our
>>> nodes are managed) we would build new images with the latest software
>>> stack, test them on a separate test system and then once happy bring
>>> them over to the production system and do an "scontrol reboot ASAP
>>> nextstate=resume reason=... $NODES" to ensure that from that point
>>> onwards no new jobs would start in the old software configuration, only
>>> the new one.
>>>
>>> Also slurmctld would know that these nodes are due to come back in
>>> "ResumeTimeout" seconds after the reboot is issued and so could plan for
>>> them as part of scheduling large jobs, rather than thinking there was no
>>> way it could do so and letting lots of smaller jobs get in the way.
More information about the slurm-users
mailing list