[slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl

Stefan Staeglich staeglis at informatik.uni-freiburg.de
Thu Mar 18 10:29:29 UTC 2021


Hi Sven,

I think it makes more sense to adjust the config file /etc/slurm-llnl/slurm.conf 
and not the systemd units:
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid

Best,
Stefan

Am Mittwoch, 17. März 2021, 19:16:38 CET schrieb Sven Duscha:
> Hi,
> 
> I experience with SLURM slurmctld an error on Ubuntu20.04, when starting
> the service (through systemctl):
> 
> 
> I installed munge and SLURM version 19.05.5-1 through the package
> manager from
> the default repository:
> 
> apt-get install munge slurm-client slurm-wlm slurm-wlm-doc slurmctld slurmd
> 
> 
> systemctl start slurmctld &
> [1] 2735
> 18:55 [root at slurm ~]# systemctl status slurmctld
> ● slurmctld.service - Slurm controller daemon
>     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
> vendor preset: enabled)
>     Active: activating (start) since Wed 2021-03-17 18:55:49 CET; 5s ago
>       Docs: man:slurmctld(8)
>    Process: 2737 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCESS)
>      Tasks: 12
>     Memory: 2.5M
>     CGroup: /system.slice/slurmctld.service
>             └─2759 /usr/sbin/slurmctld
> 
> Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
> Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurmctld.pid (yet?) after start: Operation not permitted
> 
> 
> 
> 
> After about 60 seconds slurmctld terminates:
> 
> 
> -- A stop job for unit slurmctld.service has finished.
> --
> -- The job identifier is 1043 and the job result is done.
> Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
> -- Subject: A start job for unit slurmctld.service has begun execution
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- A start job for unit slurmctld.service has begun execution.
> --
> -- The job identifier is 1044.
> Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurmctld.pid (yet?) after start: Operation not permitted
> Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: start operation
> timed out. Terminating.
> Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: Failed with result
> 'timeout'.
> 
> 
> 
> 
> My slurm.conf file lists custom PID file locations for slurmctld and slurmd:
> /etc/slurm-llnl/slurm.conf
> 
> SlurmctldPidFile=/run/slurm-llnl/slurmctld.pid
> SlurmdPidFile=/run/slurm-llnl/slurmd.pid
> 
> 
> 
> Starting the slurmctld executable by hand works fine:
> /usr/sbin/slurmctld &
> 
> pgrep slurmctld
> 2819
> [1]+  Done                    /usr/sbin/slurmctld
> pgrep slurmctld
> 2819
> squeue
>   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
> sinfo -lNe
> Wed Mar 17 19:01:45 2021
> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
> WEIGHT AVAIL_FE REASON
> ekgen1         1  cluster*    unknown*   16    2:8:1 480000       
> 0      1   (null) none
> ekgen2         1  cluster*       down*   16    2:8:1 250000       
> 0      1   (null) Not responding
> ekgen3         1    debian    unknown*   16    2:8:1 250000       
> 0      1   (null) none
> ekgen4         1  cluster*    unknown*   16    2:8:1 250000       
> 0      1   (null) none
> ekgen5         1  cluster*    unknown*   16    2:8:1 250000       
> 0      1   (null) none
> ekgen6         1    debian    unknown*   16    2:8:1 250000       
> 0      1   (null) none
> ekgen7         1  cluster*    unknown*   16    2:8:1 250000       
> 0      1   (null) none
> ekgen8         1    debian       down*   16    2:8:1 250000       
> 0      1   (null) Not responding
> ekgen9         1  cluster*    unknown*   16    2:8:1 192000       
> 0      1   (null) none
> 
> 
> 
> I tried then to modify /lib/systemd/system/slurmd.service
> 
> cp /lib/systemd/system/slurmd.service
> /lib/systemd/system/slurmd.service.orig
> 
> changed
> PIDFile=/run/slurmd.pid
> to
> PIDFile=/run/slurm-llnl/slurmd.pid
> 
> systemctl start slurmctld &
> [1] 1869
> pgrep slurm
> 1875
> squeue
>   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
> 
> after ca. 60 seconds:
> 
> Job for slurmctld.service failed because a timeout was exceeded.
> See "systemctl status slurmctld.service" and "journalctl -xe" for details
> 
> 
> - Subject: A start job for unit packagekit.service has finished successfully
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- A start job for unit packagekit.service has finished successfully.
> --
> -- The job identifier is 586.
> Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: start operation
> timed out. Terminating.
> Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: Failed with result
> 'timeout'.
> -- Subject: Unit failed
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- The unit slurmctld.service has entered the 'failed' state with result
> 'timeout'.
> Mar 17 18:28:08 slurm systemd[1]: Failed to start Slurm controller daemon.
> -- Subject: A start job for unit slurmctld.service has failed
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- A start job for unit slurmctld.service has finished with a failure.
> --
> -- The job identifier is 511 and the job result is failed.
> Mar 17 18:31:18 slurm systemd[1]: Starting Slurm controller daemon...
> -- Subject: A start job for unit slurmctld.service has begun execution
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- A start job for unit slurmctld.service has begun execution.
> --
> -- The job identifier is 662.
> Mar 17 18:31:18 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
> Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: start operation
> timed out. Terminating.
> Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: Failed with result
> 'timeout'.
> -- Subject: Unit failed
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> 
> 
> 
> mkdir /run/slurm-lnll/
> chown slurm: /run/slurm-lnll/
> 
> ls -lthrd /run/slurm-lnll/
> drwxr-xr-x 2 slurm slurm 40 Mar 17 18:34 /run/slurm-lnll/
> 
> It doesn't create the PID file
> 
> ls -lthr /run/slurm-lnll/
> total 0
> 
> 
> A work-around, writing the PID manually to the PID file, does work:
> 
> systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` >
> /run/slurm-lnll/slurmctld.pid && chown slurm:
> /run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid
> 
> 
> Still status problem reported:
> 
> systemctl status slurmctld
> ● slurmctld.service - Slurm controller daemon
>     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
> vendor preset: enabled)
>     Active: active (running) since Wed 2021-03-17 18:37:28 CET; 1min 4s ago
>       Docs: man:slurmctld(8)
>    Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCESS)
>   Main PID: 2287 (slurmctld)
>      Tasks: 7
>     Memory: 2.3M
>     CGroup: /system.slice/slurmctld.service
>             └─2287 /usr/sbin/slurmctld
> 
> Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...
> Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
> Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.
> 
> 
> But the slurmctld process doesn't crash anymore. Stopping the service
> does work:
> 
> 
> systemctl stop slurmctld.service
> systemctl status slurmctld
> ● slurmctld.service - Slurm controller daemon
>      Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
> vendor preset: enabled)
>      Active: inactive (dead) since Wed 2021-03-17 18:50:47 CET; 1s ago
>        Docs: man:slurmctld(8)
>     Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCESS)
>    Main PID: 2287 (code=exited, status=0/SUCCESS)
> 
> Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...
> Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
> Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.
> Mar 17 18:50:47 slurm systemd[1]: Stopping Slurm controller daemon...
> Mar 17 18:50:47 slurm systemd[1]: slurmctld.service: Succeeded.
> Mar 17 18:50:47 slurm systemd[1]: Stopped Slurm controller daemon.
> 
> 
> 
> I am a little astonished that the default package shows this strange
> behaviour regarding slurmctld installed through the package manager.
> 
> The base installation is Ubuntu 20.04 server installation, where I did
> no modifications apart from installing the SLURM-wlm packages and
> importing my existing configuration and munge.key.
> 
> 
> Best wishes,
> 
> Sven Duscha


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,    Germany

E-Mail : staeglis at informatik.uni-freiburg.de
WWW    : gki.informatik.uni-freiburg.de
Telefon: +49 761 203-54216
Fax    : +49 761 203-8222






More information about the slurm-users mailing list