[slurm-users] srun --reboot option is not working
Brian Andrus
toomuchit at gmail.com
Tue Mar 10 16:08:42 UTC 2020
I built/ran a quick test on older slurm and do see the issue. Looks like
a possible bug. I would open a bug with SchedMD.
I couldn't think of a good work-around, since the job would get
rescheduled to a different node if you reboot, even if you have the node
update it's own status at boot. It could probably be worked around, but
not in a simple way. Easier to upgrade to the newest release :)
Brian Andrus
On 3/9/2020 10:14 AM, MrBr @ GMail wrote:
> Hi Brian
> The nodes work with slurm without any issues till I try the "--reboot"
> option.
> I can successfully allocate the nodes or any other slurm related operation
>
> > You may want to double check that the node is actually rebooting and
> that slurmd is set to start on boot.
> That's the problem, they are not been rebooted. I'm monitoring the nodes
>
> sinfo from the nodes works without issue before and after using "--reboot"
> slurmd is up
>
>
> On Mon, Mar 9, 2020 at 5:59 PM Brian Andrus <toomuchit at gmail.com
> <mailto:toomuchit at gmail.com>> wrote:
>
> You may want to double check that the node is actually rebooting and
> that slurmd is set to start on boot.
>
> ResumeTimeoutReached, in a nutshell, means slurmd isn't talking to
> slurmctld.
> Are you able to log onto the node itself and see that it has rebooted?
> If so, try doing something like 'sinfo' from the node and verify
> it is
> able to talk to slurmctld from the node and verify slurmd started
> successfully.
>
> Brian Andrus
>
> On 3/9/2020 4:38 AM, MrBr @ GMail wrote:
> > Hi all
> >
> > I'm trying to use the --reboot option of srun to reboot the nodes
> > before allocation.
> > However the nodes not been rebooted
> >
> > The node get's stuck in allocated# state as show by sinfo or CF
> - as
> > shown by squeue
> > The logs of slurmctld and slurmd show no relevant information,
> > debug levels at "debug5"
> > Eventually the nodes got to "down" due to "ResumeTimeout reached"
> >
> > Strangest thing is that the "scontrol reboot <nodename>" works
> without
> > any issues.
> > AFAIK both command rely on the same RebootProgram
> >
> > In srun document there is a following statement: "This is only
> > supported with some system configurations and will otherwise be
> > silently ignored". May be I have this "non-supported" configuration?
> >
> > Does anyone has suggestion regarding root cause of this behavior or
> > possible investigation path?
> >
> > Tech data:
> > Slurm 19.05
> > The user that executes the srun is an admin, although it's not
> > required in 19.05
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200310/697e3736/attachment.htm>
More information about the slurm-users
mailing list