[slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl

Rodrigo Santibáñez rsantibanez.uchile at gmail.com
Wed Mar 17 19:36:13 UTC 2021


After installing SLURM in Ubuntu and before starting the services, I do:

mkdir -p /var/spool/slurmd
mkdir -p /var/lib/slurm-llnl
mkdir -p /var/lib/slurm-llnl/slurmd
mkdir -p /var/lib/slurm-llnl/slurmctld
mkdir -p /var/run/slurm-llnl (You need to change this to /run/slurm-llnl as
your location for SlurmdPidFile and SlurmctldPidFile)
mkdir -p /var/log/slurm-llnl

chmod -R 755 /var/spool/slurmd
chmod -R 755 /var/lib/slurm-llnl/
chmod -R 755 /var/run/slurm-llnl/ (Also here)
chmod -R 755 /var/log/slurm-llnl/

chown -R slurm:slurm /var/spool/slurmd
chown -R slurm:slurm /var/lib/slurm-llnl/
chown -R slurm:slurm /var/run/slurm-llnl/ (And here)
chown -R slurm:slurm /var/log/slurm-llnl/

Hope that clarifies something. My first SLURM installations failed because
of missing directories and wrong permissions.

Best!

El mié, 17 mar 2021 a las 11:56, Brian Andrus (<toomuchit at gmail.com>)
escribió:

> I am guessing you aren't overly familiar with Linux/systemd since you
> have the '&' at the end of your start command.
>
> Be that as it may, you can see it is a permissions issue. Check
> permissions on /run and ensure the slurmctld user is able to write there.
>
> You can either change the slurmctld user to one that can write there or
> change the permissions on the directory to allow the slurmctld user
> write access.
>
> Brian Andrus
>
>
> On 3/17/2021 11:16 AM, Sven Duscha wrote:
> > Hi,
> >
> > I experience with SLURM slurmctld an error on Ubuntu20.04, when starting
> > the service (through systemctl):
> >
> >
> > I installed munge and SLURM version 19.05.5-1 through the package
> > manager from
> > the default repository:
> >
> > apt-get install munge slurm-client slurm-wlm slurm-wlm-doc slurmctld
> slurmd
> >
> >
> > systemctl start slurmctld &
> > [1] 2735
> > 18:55 [root at slurm ~]# systemctl status slurmctld
> > ● slurmctld.service - Slurm controller daemon
> >      Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
> > vendor preset: enabled)
> >      Active: activating (start) since Wed 2021-03-17 18:55:49 CET; 5s ago
> >        Docs: man:slurmctld(8)
> >     Process: 2737 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> > (code=exited, status=0/SUCCESS)
> >       Tasks: 12
> >      Memory: 2.5M
> >      CGroup: /system.slice/slurmctld.service
> >              └─2759 /usr/sbin/slurmctld
> >
> > Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
> > Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
> > /run/slurmctld.pid (yet?) after start: Operation not permitted
> >
> >
> >
> >
> > After about 60 seconds slurmctld terminates:
> >
> >
> > -- A stop job for unit slurmctld.service has finished.
> > --
> > -- The job identifier is 1043 and the job result is done.
> > Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
> > -- Subject: A start job for unit slurmctld.service has begun execution
> > -- Defined-By: systemd
> > -- Support: http://www.ubuntu.com/support
> > --
> > -- A start job for unit slurmctld.service has begun execution.
> > --
> > -- The job identifier is 1044.
> > Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
> > /run/slurmctld.pid (yet?) after start: Operation not permitted
> > Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: start operation
> > timed out. Terminating.
> > Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: Failed with result
> > 'timeout'.
> >
> >
> >
> >
> > My slurm.conf file lists custom PID file locations for slurmctld and
> slurmd:
> > /etc/slurm-llnl/slurm.conf
> >
> > SlurmctldPidFile=/run/slurm-llnl/slurmctld.pid
> > SlurmdPidFile=/run/slurm-llnl/slurmd.pid
> >
> >
> >
> > Starting the slurmctld executable by hand works fine:
> > /usr/sbin/slurmctld &
> >
> > pgrep slurmctld
> > 2819
> > [1]+  Done                    /usr/sbin/slurmctld
> > pgrep slurmctld
> > 2819
> > squeue
> >    JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
> > sinfo -lNe
> > Wed Mar 17 19:01:45 2021
> > NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
> > WEIGHT AVAIL_FE REASON
> > ekgen1         1  cluster*    unknown*   16    2:8:1 480000
> > 0      1   (null) none
> > ekgen2         1  cluster*       down*   16    2:8:1 250000
> > 0      1   (null) Not responding
> > ekgen3         1    debian    unknown*   16    2:8:1 250000
> > 0      1   (null) none
> > ekgen4         1  cluster*    unknown*   16    2:8:1 250000
> > 0      1   (null) none
> > ekgen5         1  cluster*    unknown*   16    2:8:1 250000
> > 0      1   (null) none
> > ekgen6         1    debian    unknown*   16    2:8:1 250000
> > 0      1   (null) none
> > ekgen7         1  cluster*    unknown*   16    2:8:1 250000
> > 0      1   (null) none
> > ekgen8         1    debian       down*   16    2:8:1 250000
> > 0      1   (null) Not responding
> > ekgen9         1  cluster*    unknown*   16    2:8:1 192000
> > 0      1   (null) none
> >
> >
> >
> > I tried then to modify /lib/systemd/system/slurmd.service
> >
> > cp /lib/systemd/system/slurmd.service
> > /lib/systemd/system/slurmd.service.orig
> >
> > changed
> > PIDFile=/run/slurmd.pid
> > to
> > PIDFile=/run/slurm-llnl/slurmd.pid
> >
> > systemctl start slurmctld &
> > [1] 1869
> > pgrep slurm
> > 1875
> > squeue
> >    JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
> >
> > after ca. 60 seconds:
> >
> > Job for slurmctld.service failed because a timeout was exceeded.
> > See "systemctl status slurmctld.service" and "journalctl -xe" for details
> >
> >
> > - Subject: A start job for unit packagekit.service has finished
> successfully
> > -- Defined-By: systemd
> > -- Support: http://www.ubuntu.com/support
> > --
> > -- A start job for unit packagekit.service has finished successfully.
> > --
> > -- The job identifier is 586.
> > Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: start operation
> > timed out. Terminating.
> > Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: Failed with result
> > 'timeout'.
> > -- Subject: Unit failed
> > -- Defined-By: systemd
> > -- Support: http://www.ubuntu.com/support
> > --
> > -- The unit slurmctld.service has entered the 'failed' state with result
> > 'timeout'.
> > Mar 17 18:28:08 slurm systemd[1]: Failed to start Slurm controller
> daemon.
> > -- Subject: A start job for unit slurmctld.service has failed
> > -- Defined-By: systemd
> > -- Support: http://www.ubuntu.com/support
> > --
> > -- A start job for unit slurmctld.service has finished with a failure.
> > --
> > -- The job identifier is 511 and the job result is failed.
> > Mar 17 18:31:18 slurm systemd[1]: Starting Slurm controller daemon...
> > -- Subject: A start job for unit slurmctld.service has begun execution
> > -- Defined-By: systemd
> > -- Support: http://www.ubuntu.com/support
> > --
> > -- A start job for unit slurmctld.service has begun execution.
> > --
> > -- The job identifier is 662.
> > Mar 17 18:31:18 slurm systemd[1]: slurmctld.service: Can't open PID file
> > /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
> > Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: start operation
> > timed out. Terminating.
> > Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: Failed with result
> > 'timeout'.
> > -- Subject: Unit failed
> > -- Defined-By: systemd
> > -- Support: http://www.ubuntu.com/support
> >
> >
> >
> > mkdir /run/slurm-lnll/
> > chown slurm: /run/slurm-lnll/
> >
> > ls -lthrd /run/slurm-lnll/
> > drwxr-xr-x 2 slurm slurm 40 Mar 17 18:34 /run/slurm-lnll/
> >
> > It doesn't create the PID file
> >
> > ls -lthr /run/slurm-lnll/
> > total 0
> >
> >
> > A work-around, writing the PID manually to the PID file, does work:
> >
> > systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` >
> > /run/slurm-lnll/slurmctld.pid && chown slurm:
> > /run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid
> >
> >
> > Still status problem reported:
> >
> > systemctl status slurmctld
> > ● slurmctld.service - Slurm controller daemon
> >      Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
> > vendor preset: enabled)
> >      Active: active (running) since Wed 2021-03-17 18:37:28 CET; 1min 4s
> ago
> >        Docs: man:slurmctld(8)
> >     Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> > (code=exited, status=0/SUCCESS)
> >    Main PID: 2287 (slurmctld)
> >       Tasks: 7
> >      Memory: 2.3M
> >      CGroup: /system.slice/slurmctld.service
> >              └─2287 /usr/sbin/slurmctld
> >
> > Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...
> > Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file
> > /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
> > Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.
> >
> >
> > But the slurmctld process doesn't crash anymore. Stopping the service
> > does work:
> >
> >
> > systemctl stop slurmctld.service
> > systemctl status slurmctld
> > ● slurmctld.service - Slurm controller daemon
> >       Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
> > vendor preset: enabled)
> >       Active: inactive (dead) since Wed 2021-03-17 18:50:47 CET; 1s ago
> >         Docs: man:slurmctld(8)
> >      Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> > (code=exited, status=0/SUCCESS)
> >     Main PID: 2287 (code=exited, status=0/SUCCESS)
> >
> > Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...
> > Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file
> > /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
> > Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.
> > Mar 17 18:50:47 slurm systemd[1]: Stopping Slurm controller daemon...
> > Mar 17 18:50:47 slurm systemd[1]: slurmctld.service: Succeeded.
> > Mar 17 18:50:47 slurm systemd[1]: Stopped Slurm controller daemon.
> >
> >
> >
> > I am a little astonished that the default package shows this strange
> > behaviour regarding slurmctld installed through the package manager.
> >
> > The base installation is Ubuntu 20.04 server installation, where I did
> > no modifications apart from installing the SLURM-wlm packages and
> > importing my existing configuration and munge.key.
> >
> >
> > Best wishes,
> >
> > Sven Duscha
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210317/af6eacbd/attachment.htm>


More information about the slurm-users mailing list