[slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl
Brian Andrus
toomuchit at gmail.com
Wed Mar 17 18:54:12 UTC 2021
I am guessing you aren't overly familiar with Linux/systemd since you
have the '&' at the end of your start command.
Be that as it may, you can see it is a permissions issue. Check
permissions on /run and ensure the slurmctld user is able to write there.
You can either change the slurmctld user to one that can write there or
change the permissions on the directory to allow the slurmctld user
write access.
Brian Andrus
On 3/17/2021 11:16 AM, Sven Duscha wrote:
> Hi,
>
> I experience with SLURM slurmctld an error on Ubuntu20.04, when starting
> the service (through systemctl):
>
>
> I installed munge and SLURM version 19.05.5-1 through the package
> manager from
> the default repository:
>
> apt-get install munge slurm-client slurm-wlm slurm-wlm-doc slurmctld slurmd
>
>
> systemctl start slurmctld &
> [1] 2735
> 18:55 [root at slurm ~]# systemctl status slurmctld
> ● slurmctld.service - Slurm controller daemon
> Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
> vendor preset: enabled)
> Active: activating (start) since Wed 2021-03-17 18:55:49 CET; 5s ago
> Docs: man:slurmctld(8)
> Process: 2737 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCESS)
> Tasks: 12
> Memory: 2.5M
> CGroup: /system.slice/slurmctld.service
> └─2759 /usr/sbin/slurmctld
>
> Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
> Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurmctld.pid (yet?) after start: Operation not permitted
>
>
>
>
> After about 60 seconds slurmctld terminates:
>
>
> -- A stop job for unit slurmctld.service has finished.
> --
> -- The job identifier is 1043 and the job result is done.
> Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
> -- Subject: A start job for unit slurmctld.service has begun execution
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- A start job for unit slurmctld.service has begun execution.
> --
> -- The job identifier is 1044.
> Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurmctld.pid (yet?) after start: Operation not permitted
> Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: start operation
> timed out. Terminating.
> Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: Failed with result
> 'timeout'.
>
>
>
>
> My slurm.conf file lists custom PID file locations for slurmctld and slurmd:
> /etc/slurm-llnl/slurm.conf
>
> SlurmctldPidFile=/run/slurm-llnl/slurmctld.pid
> SlurmdPidFile=/run/slurm-llnl/slurmd.pid
>
>
>
> Starting the slurmctld executable by hand works fine:
> /usr/sbin/slurmctld &
>
> pgrep slurmctld
> 2819
> [1]+ Done /usr/sbin/slurmctld
> pgrep slurmctld
> 2819
> squeue
> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
> sinfo -lNe
> Wed Mar 17 19:01:45 2021
> NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK
> WEIGHT AVAIL_FE REASON
> ekgen1 1 cluster* unknown* 16 2:8:1 480000
> 0 1 (null) none
> ekgen2 1 cluster* down* 16 2:8:1 250000
> 0 1 (null) Not responding
> ekgen3 1 debian unknown* 16 2:8:1 250000
> 0 1 (null) none
> ekgen4 1 cluster* unknown* 16 2:8:1 250000
> 0 1 (null) none
> ekgen5 1 cluster* unknown* 16 2:8:1 250000
> 0 1 (null) none
> ekgen6 1 debian unknown* 16 2:8:1 250000
> 0 1 (null) none
> ekgen7 1 cluster* unknown* 16 2:8:1 250000
> 0 1 (null) none
> ekgen8 1 debian down* 16 2:8:1 250000
> 0 1 (null) Not responding
> ekgen9 1 cluster* unknown* 16 2:8:1 192000
> 0 1 (null) none
>
>
>
> I tried then to modify /lib/systemd/system/slurmd.service
>
> cp /lib/systemd/system/slurmd.service
> /lib/systemd/system/slurmd.service.orig
>
> changed
> PIDFile=/run/slurmd.pid
> to
> PIDFile=/run/slurm-llnl/slurmd.pid
>
> systemctl start slurmctld &
> [1] 1869
> pgrep slurm
> 1875
> squeue
> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
>
> after ca. 60 seconds:
>
> Job for slurmctld.service failed because a timeout was exceeded.
> See "systemctl status slurmctld.service" and "journalctl -xe" for details
>
>
> - Subject: A start job for unit packagekit.service has finished successfully
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- A start job for unit packagekit.service has finished successfully.
> --
> -- The job identifier is 586.
> Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: start operation
> timed out. Terminating.
> Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: Failed with result
> 'timeout'.
> -- Subject: Unit failed
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- The unit slurmctld.service has entered the 'failed' state with result
> 'timeout'.
> Mar 17 18:28:08 slurm systemd[1]: Failed to start Slurm controller daemon.
> -- Subject: A start job for unit slurmctld.service has failed
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- A start job for unit slurmctld.service has finished with a failure.
> --
> -- The job identifier is 511 and the job result is failed.
> Mar 17 18:31:18 slurm systemd[1]: Starting Slurm controller daemon...
> -- Subject: A start job for unit slurmctld.service has begun execution
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- A start job for unit slurmctld.service has begun execution.
> --
> -- The job identifier is 662.
> Mar 17 18:31:18 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
> Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: start operation
> timed out. Terminating.
> Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: Failed with result
> 'timeout'.
> -- Subject: Unit failed
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
>
>
>
> mkdir /run/slurm-lnll/
> chown slurm: /run/slurm-lnll/
>
> ls -lthrd /run/slurm-lnll/
> drwxr-xr-x 2 slurm slurm 40 Mar 17 18:34 /run/slurm-lnll/
>
> It doesn't create the PID file
>
> ls -lthr /run/slurm-lnll/
> total 0
>
>
> A work-around, writing the PID manually to the PID file, does work:
>
> systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` >
> /run/slurm-lnll/slurmctld.pid && chown slurm:
> /run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid
>
>
> Still status problem reported:
>
> systemctl status slurmctld
> ● slurmctld.service - Slurm controller daemon
> Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
> vendor preset: enabled)
> Active: active (running) since Wed 2021-03-17 18:37:28 CET; 1min 4s ago
> Docs: man:slurmctld(8)
> Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCESS)
> Main PID: 2287 (slurmctld)
> Tasks: 7
> Memory: 2.3M
> CGroup: /system.slice/slurmctld.service
> └─2287 /usr/sbin/slurmctld
>
> Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...
> Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
> Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.
>
>
> But the slurmctld process doesn't crash anymore. Stopping the service
> does work:
>
>
> systemctl stop slurmctld.service
> systemctl status slurmctld
> ● slurmctld.service - Slurm controller daemon
> Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
> vendor preset: enabled)
> Active: inactive (dead) since Wed 2021-03-17 18:50:47 CET; 1s ago
> Docs: man:slurmctld(8)
> Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCESS)
> Main PID: 2287 (code=exited, status=0/SUCCESS)
>
> Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...
> Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
> Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.
> Mar 17 18:50:47 slurm systemd[1]: Stopping Slurm controller daemon...
> Mar 17 18:50:47 slurm systemd[1]: slurmctld.service: Succeeded.
> Mar 17 18:50:47 slurm systemd[1]: Stopped Slurm controller daemon.
>
>
>
> I am a little astonished that the default package shows this strange
> behaviour regarding slurmctld installed through the package manager.
>
> The base installation is Ubuntu 20.04 server installation, where I did
> no modifications apart from installing the SLURM-wlm packages and
> importing my existing configuration and munge.key.
>
>
> Best wishes,
>
> Sven Duscha
>
>
More information about the slurm-users
mailing list