[slurm-users] Unable to start slurmd service

Jaep Emmanuel emmanuel.jaep at epfl.ch
Tue Nov 16 16:50:50 UTC 2021


Thanks for the quick reply.

check if munge is working properly

root at ecpsinf01:~# munge -n | ssh ecpsc10 unmunge
Warning: the ECDSA host key for 'ecpsc10' differs from the key for the IP address '128.178.242.136'
Offending key for IP in /root/.ssh/known_hosts:5
Matching host key in /root/.ssh/known_hosts:28
Are you sure you want to continue connecting (yes/no)? yes
STATUS:           Success (0)
ENCODE_HOST:      ecpsc10 (127.0.1.1)
ENCODE_TIME:      2021-11-16 16:57:56 +0100 (1637078276)
DECODE_TIME:      2021-11-16 16:58:10 +0100 (1637078290)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

Check if SE linux is enforced

controller node
root at ecpsinf01:~# getenforce
-bash: getenforce: command not found
root at ecpsinf01:~# sestatus
-bash: sestatus: command not found

compute node
root at ecpsc10:~# getenforce

Command 'getenforce' not found, but can be installed with:

apt install selinux-utils

root at ecpsc10:~# sestatus

Command 'sestatus' not found, but can be installed with:

apt install policycoreutils

Check slurm log file
[2021-11-16T16:19:54.646] debug:  Log file re-opened
[2021-11-16T16:19:54.666] Message aggregation disabled
[2021-11-16T16:19:54.666] topology NONE plugin loaded
[2021-11-16T16:19:54.666] route default plugin loaded
[2021-11-16T16:19:54.667] CPU frequency setting not configured for this node
[2021-11-16T16:19:54.667] debug:  Resource spec: No specialized cores configured by default on this node
[2021-11-16T16:19:54.667] debug:  Resource spec: Reserved system memory limit not configured for this node
[2021-11-16T16:19:54.667] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2021-11-16T16:19:54.667] debug:  Ignoring obsolete CgroupReleaseAgentDir option.
[2021-11-16T16:19:54.669] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2021-11-16T16:19:54.670] debug:  Ignoring obsolete CgroupReleaseAgentDir option.
[2021-11-16T16:19:54.670] debug:  task/cgroup: now constraining jobs allocated cores
[2021-11-16T16:19:54.670] debug:  task/cgroup/memory: total:112428M allowed:100%(enforced), swap:0%(permissive), max:100%(112428M) max+swap:100%(224856M) min:30M kmem:100%(112428M enforced) min:30M swappiness:0(unset)
[2021-11-16T16:19:54.670] debug:  task/cgroup: now constraining jobs allocated memory
[2021-11-16T16:19:54.670] debug:  task/cgroup: now constraining jobs allocated devices
[2021-11-16T16:19:54.670] debug:  task/cgroup: loaded
[2021-11-16T16:19:54.671] debug:  Munge authentication plugin loaded
[2021-11-16T16:19:54.671] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2021-11-16T16:19:54.671] Munge cryptographic signature plugin loaded
[2021-11-16T16:19:54.673] slurmd version 17.11.12 started
[2021-11-16T16:19:54.673] debug:  Job accounting gather cgroup plugin loaded
[2021-11-16T16:19:54.674] debug:  job_container none plugin loaded
[2021-11-16T16:19:54.674] debug:  switch NONE plugin loaded
[2021-11-16T16:19:54.674] slurmd started on Tue, 16 Nov 2021 16:19:54 +0100
[2021-11-16T16:19:54.675] CPUs=16 Boards=1 Sockets=2 Cores=8 Threads=1 Memory=112428 TmpDisk=224253 Uptime=1911799 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2021-11-16T16:19:54.675] debug:  AcctGatherEnergy NONE plugin loaded
[2021-11-16T16:19:54.675] debug:  AcctGatherProfile NONE plugin loaded
[2021-11-16T16:19:54.675] debug:  AcctGatherInterconnect NONE plugin loaded
[2021-11-16T16:19:54.676] debug:  AcctGatherFilesystem NONE plugin loaded

check if firewalld is enable
No


From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Hadrian Djohari <hxd58 at case.edu>
Reply to: Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Tuesday, 16 November 2021 at 16:56
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Unable to start slurmd service

There can be few possibilities:

  1.  Check if munge is working properly. From the scheduler master run "munge -n | ssh ecpsc10 unmunge"
  2.  Check if selinux is enforced
  3.  Check if firewalld or similar firewall is enabled
  4.  Check the logs /var/log/slurm/slurmctld.log or slurmd.log on the compute node
Best,

On Tue, Nov 16, 2021 at 10:12 AM Jaep Emmanuel <emmanuel.jaep at epfl.ch<mailto:emmanuel.jaep at epfl.ch>> wrote:
Hi,

It might be a newbie question since I'm new to slurm.
I'm trying to restart the slurmd service on one of our Ubuntu box.

The slurmd.service is defined by:

[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity

[Install]
WantedBy=multi-user.target


The service start without issue (systemctl start slurmd.service).
However, when checking the status of the service, I get a couple of error messages, but nothing alarming:

~# systemctl status slurmd.service
● slurmd.service - Slurm node daemon
     Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2021-11-16 15:58:01 CET; 50s ago
    Process: 2713019 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 2713021 (slurmd)
      Tasks: 1 (limit: 134845)
     Memory: 1.9M
     CGroup: /system.slice/slurmd.service
             └─2713021 /usr/sbin/slurmd -d /usr/sbin/slurmstepd

Nov 16 15:58:01 ecpsc10 systemd[1]: Starting Slurm node daemon...
Nov 16 15:58:01 ecpsc10 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not pe>
Nov 16 15:58:01 ecpsc10 systemd[1]: Started Slurm node daemon.

Unfortunately, the node is still seen as down when a issue a 'sinfo':
root at ecpsc10:~# sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
Compute         up   infinite      2   idle ecpsc[11-12]
•Compute         up   infinite      1   down ecpsc10
FastCompute*    up   infinite      2   idle ecpsf[10-11]

When I get the details on this node, I get the following details:
root at ecpsc10:~# scontrol show node ecpsc10
NodeName=ecpsc10 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=ecpsc10 NodeHostName=ecpsc10 Version=17.11
   OS=Linux 5.8.0-43-generic #49~20.04.1-Ubuntu SMP Fri Feb 5 09:57:56 UTC 2021
   RealMemory=40195 AllocMem=0 FreeMem=4585 Sockets=2 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=Compute
   BootTime=2021-10-25T14:16:35 SlurmdStartTime=2021-11-16T15:58:01
   CfgTRES=cpu=16,mem=40195M,billing=16
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [slurm at 2021-11-16T14:41:04]


From the reason, I get that the daemon won't reload because the machine was rebooted.
However, the /etc/slurm/slurm.conf looks like:

root at ecpsc10:~# cat /etc/slurm/slurm.conf | grep -i returntoservice
ReturnToService=2


So I'm quite puzzled on the reason why the node will not go back online.

Any help will be greatly appreciated.

Best,

Emmanuel


--
Hadrian Djohari
Manager of Research Computing Services, [U]Tech
Case Western Reserve University
(W): 216-368-0395
(M): 216-798-7490
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211116/ac24766c/attachment-0001.htm>


More information about the slurm-users mailing list