[slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl

Sven Duscha sven.duscha at tum.de
Wed Mar 17 18:16:38 UTC 2021


Hi,

I experience with SLURM slurmctld an error on Ubuntu20.04, when starting
the service (through systemctl):


I installed munge and SLURM version 19.05.5-1 through the package
manager from
the default repository:

apt-get install munge slurm-client slurm-wlm slurm-wlm-doc slurmctld slurmd


systemctl start slurmctld &
[1] 2735
18:55 [root at slurm ~]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
    Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
vendor preset: enabled)
    Active: activating (start) since Wed 2021-03-17 18:55:49 CET; 5s ago
      Docs: man:slurmctld(8)
   Process: 2737 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)
     Tasks: 12
    Memory: 2.5M
    CGroup: /system.slice/slurmctld.service
            └─2759 /usr/sbin/slurmctld

Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurmctld.pid (yet?) after start: Operation not permitted




After about 60 seconds slurmctld terminates:


-- A stop job for unit slurmctld.service has finished.
--
-- The job identifier is 1043 and the job result is done.
Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
-- Subject: A start job for unit slurmctld.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit slurmctld.service has begun execution.
--
-- The job identifier is 1044.
Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurmctld.pid (yet?) after start: Operation not permitted
Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.




My slurm.conf file lists custom PID file locations for slurmctld and slurmd:
/etc/slurm-llnl/slurm.conf

SlurmctldPidFile=/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/run/slurm-llnl/slurmd.pid



Starting the slurmctld executable by hand works fine:
/usr/sbin/slurmctld &

pgrep slurmctld
2819
[1]+  Done                    /usr/sbin/slurmctld
pgrep slurmctld
2819
squeue
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
sinfo -lNe
Wed Mar 17 19:01:45 2021
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
WEIGHT AVAIL_FE REASON
ekgen1         1  cluster*    unknown*   16    2:8:1 480000       
0      1   (null) none
ekgen2         1  cluster*       down*   16    2:8:1 250000       
0      1   (null) Not responding
ekgen3         1    debian    unknown*   16    2:8:1 250000       
0      1   (null) none
ekgen4         1  cluster*    unknown*   16    2:8:1 250000       
0      1   (null) none
ekgen5         1  cluster*    unknown*   16    2:8:1 250000       
0      1   (null) none
ekgen6         1    debian    unknown*   16    2:8:1 250000       
0      1   (null) none
ekgen7         1  cluster*    unknown*   16    2:8:1 250000       
0      1   (null) none
ekgen8         1    debian       down*   16    2:8:1 250000       
0      1   (null) Not responding
ekgen9         1  cluster*    unknown*   16    2:8:1 192000       
0      1   (null) none



I tried then to modify /lib/systemd/system/slurmd.service

cp /lib/systemd/system/slurmd.service
/lib/systemd/system/slurmd.service.orig

changed
PIDFile=/run/slurmd.pid
to
PIDFile=/run/slurm-llnl/slurmd.pid

systemctl start slurmctld &
[1] 1869
pgrep slurm
1875
squeue
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

after ca. 60 seconds:

Job for slurmctld.service failed because a timeout was exceeded.
See "systemctl status slurmctld.service" and "journalctl -xe" for details


- Subject: A start job for unit packagekit.service has finished successfully
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit packagekit.service has finished successfully.
--
-- The job identifier is 586.
Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit slurmctld.service has entered the 'failed' state with result
'timeout'.
Mar 17 18:28:08 slurm systemd[1]: Failed to start Slurm controller daemon.
-- Subject: A start job for unit slurmctld.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit slurmctld.service has finished with a failure.
--
-- The job identifier is 511 and the job result is failed.
Mar 17 18:31:18 slurm systemd[1]: Starting Slurm controller daemon...
-- Subject: A start job for unit slurmctld.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit slurmctld.service has begun execution.
--
-- The job identifier is 662.
Mar 17 18:31:18 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support



mkdir /run/slurm-lnll/
chown slurm: /run/slurm-lnll/

ls -lthrd /run/slurm-lnll/
drwxr-xr-x 2 slurm slurm 40 Mar 17 18:34 /run/slurm-lnll/

It doesn't create the PID file

ls -lthr /run/slurm-lnll/
total 0


A work-around, writing the PID manually to the PID file, does work:

systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` >
/run/slurm-lnll/slurmctld.pid && chown slurm:
/run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid


Still status problem reported:

systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
    Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
vendor preset: enabled)
    Active: active (running) since Wed 2021-03-17 18:37:28 CET; 1min 4s ago
      Docs: man:slurmctld(8)
   Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)
  Main PID: 2287 (slurmctld)
     Tasks: 7
    Memory: 2.3M
    CGroup: /system.slice/slurmctld.service
            └─2287 /usr/sbin/slurmctld

Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...
Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.


But the slurmctld process doesn't crash anymore. Stopping the service
does work:


systemctl stop slurmctld.service
systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
vendor preset: enabled)
     Active: inactive (dead) since Wed 2021-03-17 18:50:47 CET; 1s ago
       Docs: man:slurmctld(8)
    Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)
   Main PID: 2287 (code=exited, status=0/SUCCESS)

Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...
Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.
Mar 17 18:50:47 slurm systemd[1]: Stopping Slurm controller daemon...
Mar 17 18:50:47 slurm systemd[1]: slurmctld.service: Succeeded.
Mar 17 18:50:47 slurm systemd[1]: Stopped Slurm controller daemon.



I am a little astonished that the default package shows this strange
behaviour regarding slurmctld installed through the package manager.

The base installation is Ubuntu 20.04 server installation, where I did
no modifications apart from installing the SLURM-wlm packages and
importing my existing configuration and munge.key.


Best wishes,

Sven Duscha


-- 
Sven Duscha
Deutsches Herzzentrum München
Technische Universität München
Lazarettstraße 36
80636 München

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5463 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210317/5dba5caf/attachment.bin>


More information about the slurm-users mailing list