[slurm-users] Unable to start slurmd service

Tue Nov 16 18:44:56 UTC 2021

scontrol update nodename=name-of-node state=resume

On Tue, Nov 16, 2021, 1:36 PM Jaep Emmanuel <emmanuel.jaep at epfl.ch> wrote:

> How do you do that?
>
> As per documentation, the resume command applies to the job list (
> https://slurm.schedmd.com/scontrol.html), not to the node.
>
>
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Stephen Cousins <steve.cousins at maine.edu>
> *Reply to: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Date: *Tuesday, 16 November 2021 at 19:09
> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject: *Re: [slurm-users] Unable to start slurmd service
>
>
>
> I think you just need to use scontrol to "resume" that node.
>
>
>
> On Tue, Nov 16, 2021, 10:10 AM Jaep Emmanuel <emmanuel.jaep at epfl.ch>
> wrote:
>
> Hi,
>
>
>
> It might be a newbie question since I'm new to slurm.
>
> I'm trying to restart the slurmd service on one of our Ubuntu box.
>
>
>
> The slurmd.service is defined by:
>
>
>
> [Unit]
>
> Description=Slurm node daemon
>
> After=network.target munge.service
>
> ConditionPathExists=/etc/slurm/slurm.conf
>
>
>
> [Service]
>
> Type=forking
>
> EnvironmentFile=-/etc/sysconfig/slurmd
>
> ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS
>
> ExecReload=/bin/kill -HUP $MAINPID
>
> PIDFile=/var/run/slurmd.pid
>
> KillMode=process
>
> LimitNOFILE=51200
>
> LimitMEMLOCK=infinity
>
> LimitSTACK=infinity
>
>
>
> [Install]
>
> WantedBy=multi-user.target
>
>
>
>
>
> The service start without issue (systemctl start slurmd.service).
>
> However, when checking the status of the service, I get a couple of error
> messages, but nothing alarming:
>
>
>
> ~# systemctl status slurmd.service
>
> ● slurmd.service - Slurm node daemon
>
>      Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
> preset: enabled)
>
>      Active: active (running) since Tue 2021-11-16 15:58:01 CET; 50s ago
>
>     Process: 2713019 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd
> $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
>
>    Main PID: 2713021 (slurmd)
>
>       Tasks: 1 (limit: 134845)
>
>      Memory: 1.9M
>
>      CGroup: /system.slice/slurmd.service
>
>              └─2713021 /usr/sbin/slurmd -d /usr/sbin/slurmstepd
>
>
>
> Nov 16 15:58:01 ecpsc10 systemd[1]: Starting Slurm node daemon...
>
> Nov 16 15:58:01 ecpsc10 systemd[1]: slurmd.service: Can't open PID file
> /run/slurmd.pid (yet?) after start: Operation not pe>
>
> Nov 16 15:58:01 ecpsc10 systemd[1]: Started Slurm node daemon.
>
>
>
> Unfortunately, the node is still seen as down when a issue a 'sinfo':
>
> root at ecpsc10:~# sinfo
>
> PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
>
> Compute         up   infinite      2   idle ecpsc[11-12]
>
> èCompute         up   infinite      1   down ecpsc10
>
> FastCompute*    up   infinite      2   idle ecpsf[10-11]
>
>
>
> When I get the details on this node, I get the following details:
>
> root at ecpsc10:~# scontrol show node ecpsc10
>
> NodeName=ecpsc10 Arch=x86_64 CoresPerSocket=8
>
>    CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.00
>
>    AvailableFeatures=(null)
>
>    ActiveFeatures=(null)
>
>    Gres=(null)
>
>    NodeAddr=ecpsc10 NodeHostName=ecpsc10 Version=17.11
>
>    OS=Linux 5.8.0-43-generic #49~20.04.1-Ubuntu SMP Fri Feb 5 09:57:56 UTC
> 2021
>
>    RealMemory=40195 AllocMem=0 FreeMem=4585 Sockets=2 Boards=1
>
>    State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>
>    Partitions=Compute
>
>    BootTime=2021-10-25T14:16:35 SlurmdStartTime=2021-11-16T15:58:01
>
>    CfgTRES=cpu=16,mem=40195M,billing=16
>
>    AllocTRES=
>
>    CapWatts=n/a
>
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>    Reason=Node unexpectedly rebooted [slurm at 2021-11-16T14:41:04]
>
>
>
>
>
> From the reason, I get that the daemon won't reload because the machine
> was rebooted.
>
> However, the /etc/slurm/slurm.conf looks like:
>
>
>
> root at ecpsc10:~# cat /etc/slurm/slurm.conf | grep -i returntoservice
>
> ReturnToService=2
>
>
>
>
>
> So I'm quite puzzled on the reason why the node will not go back online.
>
>
>
> Any help will be greatly appreciated.
>
>
>
> Best,
>
>
>
> Emmanuel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211116/98346bab/attachment.htm>