[slurm-users] srun --reboot option is not working

MrBr @ GMail mrbr.mail at gmail.com
Mon Mar 9 19:21:35 UTC 2020


>  Ah. Looks like the --reboot option is telling slurmctld to put them in
the CF state and wait for them to come back up. Slurmctld then waits for
them to 'disconnect' and come back. Since they never reboot (therefore
never disconnect), slurmctld keeps them in the CF state until the timeout
occurs.
Hmm, seems to be logical. Is there a way for me to confirm this? slurmctld
log says nothing.

> Do you have RebootProgram defined?
Yes, and it successfully works with "scontrol reboot <servername>"

> so normal users cannot use "--reboot"

1. as far as i understand this is true since ver. 20.02. I have 19.05

2. If I'm using the wrong user, should it be reflected in some log?

3. I think that I've configured my user as admin. But i'm not 100% sure.
please see output below

$ sacctmgr list user michael
      User   Def Acct     Admin
---------- ---------- ---------
 michael   sw_user Administ+


On Mon, Mar 9, 2020 at 8:43 PM Brian Andrus <toomuchit at gmail.com> wrote:

> Ah. Looks like the --reboot option is telling slurmctld to put them in the
> CF state and wait for them to come back up. Slurmctld then waits for them
> to 'disconnect' and come back. Since they never reboot (therefore never
> disconnect), slurmctld keeps them in the CF state until the timeout occurs.
>
> Do you have RebootProgram defined?
>
> Note, the manual states:
>
>               Force the allocated nodes to reboot before starting the
> job.  This is only supported with some system configurations and will
> otherwise be silently ignored. *Only root, SlurmUser or admins can reboot
> nodes.*
>
> so normal users cannot use "--reboot"
>
> Brian Andrus
> On 3/9/2020 10:14 AM, MrBr @ GMail wrote:
>
> Hi Brian
> The nodes work with slurm without any issues till I try the "--reboot"
> option.
> I can successfully allocate the nodes or any other slurm related operation
>
> > You may want to double check that the node is actually rebooting and
> that slurmd is set to start on boot.
> That's the problem, they are not been rebooted. I'm monitoring the nodes
>
> sinfo from the nodes works without issue before and after using "--reboot"
> slurmd is up
>
>
> On Mon, Mar 9, 2020 at 5:59 PM Brian Andrus <toomuchit at gmail.com> wrote:
>
>> You may want to double check that the node is actually rebooting and
>> that slurmd is set to start on boot.
>>
>> ResumeTimeoutReached, in a nutshell, means slurmd isn't talking to
>> slurmctld.
>> Are you able to log onto the node itself and see that it has rebooted?
>> If so, try doing something like 'sinfo' from the node and verify it is
>> able to talk to slurmctld from the node and verify slurmd started
>> successfully.
>>
>> Brian Andrus
>>
>> On 3/9/2020 4:38 AM, MrBr @ GMail wrote:
>> > Hi all
>> >
>> > I'm trying to use the --reboot option of srun to reboot the nodes
>> > before allocation.
>> > However the nodes not been rebooted
>> >
>> > The node get's stuck in allocated# state as show by sinfo or CF - as
>> > shown by squeue
>> > The logs of slurmctld and slurmd show no relevant information,
>> > debug levels at "debug5"
>> > Eventually the nodes got to "down" due to "ResumeTimeout reached"
>> >
>> > Strangest thing is that the "scontrol reboot <nodename>" works without
>> > any issues.
>> > AFAIK both command rely on the same RebootProgram
>> >
>> > In srun document there is a following statement: "This is only
>> > supported with some system configurations and will otherwise be
>> > silently ignored". May be I have this "non-supported" configuration?
>> >
>> > Does anyone has suggestion regarding root cause of this behavior or
>> > possible investigation path?
>> >
>> > Tech data:
>> > Slurm 19.05
>> > The user that executes the srun is an admin, although it's not
>> > required in 19.05
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200309/adb2655d/attachment.htm>


More information about the slurm-users mailing list