<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>I built/ran a quick test on older slurm and do see the issue.
Looks like a possible bug. I would open a bug with SchedMD.</p>
<p>I couldn't think of a good work-around, since the job would get
rescheduled to a different node if you reboot, even if you have
the node update it's own status at boot. It could probably be
worked around, but not in a simple way. Easier to upgrade to the
newest release :)</p>
<p>Brian Andrus<br>
</p>
<div class="moz-cite-prefix">On 3/9/2020 10:14 AM, MrBr @ GMail
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CADAEx60B0-=vgKXiBwcL=HANT11L82-ghvS9=Dud1okPFUQTvQ@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Hi Brian
<div>The nodes work with slurm without any issues till I try the
"--reboot" option.</div>
<div>I can successfully allocate the nodes or any other slurm
related operation</div>
<div><br>
</div>
<div>> You may want to double check that the node is actually
rebooting and</div>
that slurmd is set to start on boot.
<div>That's the problem, they are not been rebooted. I'm
monitoring the nodes<br>
</div>
<div><br>
</div>
<div>sinfo from the nodes works without issue before and after
using "--reboot"</div>
<div>slurmd is up<br>
<div>
<div><br>
</div>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Mar 9, 2020 at 5:59 PM
Brian Andrus <<a href="mailto:toomuchit@gmail.com"
moz-do-not-send="true">toomuchit@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">You
may want to double check that the node is actually rebooting
and <br>
that slurmd is set to start on boot.<br>
<br>
ResumeTimeoutReached, in a nutshell, means slurmd isn't
talking to <br>
slurmctld.<br>
Are you able to log onto the node itself and see that it has
rebooted?<br>
If so, try doing something like 'sinfo' from the node and
verify it is <br>
able to talk to slurmctld from the node and verify slurmd
started <br>
successfully.<br>
<br>
Brian Andrus<br>
<br>
On 3/9/2020 4:38 AM, MrBr @ GMail wrote:<br>
> Hi all<br>
><br>
> I'm trying to use the --reboot option of srun to reboot
the nodes <br>
> before allocation.<br>
> However the nodes not been rebooted<br>
><br>
> The node get's stuck in allocated# state as show by sinfo
or CF - as <br>
> shown by squeue<br>
> The logs of slurmctld and slurmd show no relevant
information, <br>
> debug levels at "debug5"<br>
> Eventually the nodes got to "down" due to "ResumeTimeout
reached"<br>
><br>
> Strangest thing is that the "scontrol reboot
<nodename>" works without <br>
> any issues.<br>
> AFAIK both command rely on the same RebootProgram<br>
><br>
> In srun document there is a following statement: "This is
only <br>
> supported with some system configurations and will
otherwise be <br>
> silently ignored". May be I have this "non-supported"
configuration?<br>
><br>
> Does anyone has suggestion regarding root cause of this
behavior or <br>
> possible investigation path?<br>
><br>
> Tech data:<br>
> Slurm 19.05<br>
> The user that executes the srun is an admin, although
it's not <br>
> required in 19.05<br>
<br>
</blockquote>
</div>
</blockquote>
</body>
</html>