[slurm-users] srun --reboot option is not working

Brian Andrus toomuchit at gmail.com
Mon Mar 9 15:56:54 UTC 2020


You may want to double check that the node is actually rebooting and 
that slurmd is set to start on boot.

ResumeTimeoutReached, in a nutshell, means slurmd isn't talking to 
slurmctld.
Are you able to log onto the node itself and see that it has rebooted?
If so, try doing something like 'sinfo' from the node and verify it is 
able to talk to slurmctld from the node and verify slurmd started 
successfully.

Brian Andrus

On 3/9/2020 4:38 AM, MrBr @ GMail wrote:
> Hi all
>
> I'm trying to use the --reboot option of srun to reboot the nodes 
> before allocation.
> However the nodes not been rebooted
>
> The node get's stuck in allocated# state as show by sinfo or CF - as 
> shown by squeue
> The logs of slurmctld and slurmd show no relevant information, 
> debug levels at "debug5"
> Eventually the nodes got to "down" due to "ResumeTimeout reached"
>
> Strangest thing is that the "scontrol reboot <nodename>" works without 
> any issues.
> AFAIK both command rely on the same RebootProgram
>
> In srun document there is a following statement: "This is only 
> supported with some system configurations and will otherwise be 
> silently ignored". May be I have this "non-supported" configuration?
>
> Does anyone has suggestion regarding root cause of this behavior or 
> possible investigation path?
>
> Tech data:
> Slurm 19.05
> The user that executes the srun is an admin, although it's not 
> required in 19.05



More information about the slurm-users mailing list