[slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

Tue Oct 24 19:39:46 UTC 2023

Hi,

from my understanding, if I run "scontrol reboot <node>", the node 
should continue to operate as usual and reboots once it is idle. When 
adding the ASAP flag (scontrol reboot ASAP <node>), the node should go 
into drain state and not accept any more jobs.

Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME 
<node>", the node goes in "mix@" state (not drain), but no new jobs get 
scheduled until the node reboots. Essentially I get draining behavior, 
even though the node's state is not "drain". Note that this behavior is 
caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled 
as expected. Does anyone have an idea why that could be?

I am running slurm 22.05.9.

Steps to reproduce:

# To prevent node from rebooting immediately
sbatch -t 1:00:00 -c 1 --mem-per-cpu 1G -w <node> ./long_running_script.sh

# Request reboot
scontrol reboot nextstate=RESUME <node>

# Run interactive command, which does not start until "scontrol 
cancel_reboot <node>" is executed in another shell
srun -t 1:00:00 -c 1 --mem-per-cpu 1G -w <node> --pty bash

Thanks a lot in advance!

Best,

Tim