[slurm-users] Unable to start slurmd service
Jaep Emmanuel
emmanuel.jaep at epfl.ch
Tue Nov 16 15:07:34 UTC 2021
Hi,
It might be a newbie question since I'm new to slurm.
I'm trying to restart the slurmd service on one of our Ubuntu box.
The slurmd.service is defined by:
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
[Install]
WantedBy=multi-user.target
The service start without issue (systemctl start slurmd.service).
However, when checking the status of the service, I get a couple of error messages, but nothing alarming:
~# systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-11-16 15:58:01 CET; 50s ago
Process: 2713019 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 2713021 (slurmd)
Tasks: 1 (limit: 134845)
Memory: 1.9M
CGroup: /system.slice/slurmd.service
└─2713021 /usr/sbin/slurmd -d /usr/sbin/slurmstepd
Nov 16 15:58:01 ecpsc10 systemd[1]: Starting Slurm node daemon...
Nov 16 15:58:01 ecpsc10 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not pe>
Nov 16 15:58:01 ecpsc10 systemd[1]: Started Slurm node daemon.
Unfortunately, the node is still seen as down when a issue a 'sinfo':
root at ecpsc10:~# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
Compute up infinite 2 idle ecpsc[11-12]
•Compute up infinite 1 down ecpsc10
FastCompute* up infinite 2 idle ecpsf[10-11]
When I get the details on this node, I get the following details:
root at ecpsc10:~# scontrol show node ecpsc10
NodeName=ecpsc10 Arch=x86_64 CoresPerSocket=8
CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=ecpsc10 NodeHostName=ecpsc10 Version=17.11
OS=Linux 5.8.0-43-generic #49~20.04.1-Ubuntu SMP Fri Feb 5 09:57:56 UTC 2021
RealMemory=40195 AllocMem=0 FreeMem=4585 Sockets=2 Boards=1
State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=Compute
BootTime=2021-10-25T14:16:35 SlurmdStartTime=2021-11-16T15:58:01
CfgTRES=cpu=16,mem=40195M,billing=16
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Node unexpectedly rebooted [slurm at 2021-11-16T14:41:04]
From the reason, I get that the daemon won't reload because the machine was rebooted.
However, the /etc/slurm/slurm.conf looks like:
root at ecpsc10:~# cat /etc/slurm/slurm.conf | grep -i returntoservice
ReturnToService=2
So I'm quite puzzled on the reason why the node will not go back online.
Any help will be greatly appreciated.
Best,
Emmanuel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211116/755c40f3/attachment-0001.htm>
More information about the slurm-users
mailing list