[slurm-users] srun --reboot option is not working

Brian Andrus toomuchit at gmail.com
Tue Mar 10 16:08:42 UTC 2020


I built/ran a quick test on older slurm and do see the issue. Looks like 
a possible bug. I would open a bug with SchedMD.

I couldn't think of a good work-around, since the job would get 
rescheduled to a different node if you reboot, even if you have the node 
update it's own status at boot. It could probably be worked around, but 
not in a simple way. Easier to upgrade to the newest release :)

Brian Andrus

On 3/9/2020 10:14 AM, MrBr @ GMail wrote:
> Hi Brian
> The nodes work with slurm without any issues till I try the "--reboot" 
> option.
> I can successfully allocate the nodes or any other slurm related operation
>
> > You may want to double check that the node is actually rebooting and
> that slurmd is set to start on boot.
> That's the problem, they are not been rebooted. I'm monitoring the nodes
>
> sinfo from the nodes works without issue before and after using "--reboot"
> slurmd is up
>
>
> On Mon, Mar 9, 2020 at 5:59 PM Brian Andrus <toomuchit at gmail.com 
> <mailto:toomuchit at gmail.com>> wrote:
>
>     You may want to double check that the node is actually rebooting and
>     that slurmd is set to start on boot.
>
>     ResumeTimeoutReached, in a nutshell, means slurmd isn't talking to
>     slurmctld.
>     Are you able to log onto the node itself and see that it has rebooted?
>     If so, try doing something like 'sinfo' from the node and verify
>     it is
>     able to talk to slurmctld from the node and verify slurmd started
>     successfully.
>
>     Brian Andrus
>
>     On 3/9/2020 4:38 AM, MrBr @ GMail wrote:
>     > Hi all
>     >
>     > I'm trying to use the --reboot option of srun to reboot the nodes
>     > before allocation.
>     > However the nodes not been rebooted
>     >
>     > The node get's stuck in allocated# state as show by sinfo or CF
>     - as
>     > shown by squeue
>     > The logs of slurmctld and slurmd show no relevant information,
>     > debug levels at "debug5"
>     > Eventually the nodes got to "down" due to "ResumeTimeout reached"
>     >
>     > Strangest thing is that the "scontrol reboot <nodename>" works
>     without
>     > any issues.
>     > AFAIK both command rely on the same RebootProgram
>     >
>     > In srun document there is a following statement: "This is only
>     > supported with some system configurations and will otherwise be
>     > silently ignored". May be I have this "non-supported" configuration?
>     >
>     > Does anyone has suggestion regarding root cause of this behavior or
>     > possible investigation path?
>     >
>     > Tech data:
>     > Slurm 19.05
>     > The user that executes the srun is an admin, although it's not
>     > required in 19.05
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200310/697e3736/attachment.htm>


More information about the slurm-users mailing list