[slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl

Thu Mar 18 11:49:12 UTC 2021

Hi,

thanks for all the responses.

On 18.03.21 11:29, Stefan Staeglich wrote:
> I think it makes more sense to adjust the config file /etc/slurm-llnl/slurm.conf 
> and not the systemd units:
> SlurmctldPidFile=/run/slurmctld.pid
> SlurmdPidFile=/run/slurmd.pid

That was of course my first approach. I had used the directory
/run/slurm-lnll/ on my CentOS 7 installations, where I copied the
slurm.conf file from over.

It turned out that those directories I defined there weren't used. The
error message suggested that slurmctld still tried to write to
/run/slurmctld.pid.

Changing the systemd file was my last resort. And as mentioned I don't
expect to have to do that much fiddling with an (relative old 19.05-5)
package manager version.  It seems "snap" provides a more current
version 20.02.1:

snap install slurm         # version 20.02.1, or
apt  install slurm-client  # version 19.05.5-1

The underlying distribution installation also hasn't been modified by
me, I want to use Ubuntu20.04 as my future cluster OS, and the
kvm-virtualized SLURM controller was the first I tried.

Brian Andrus suggested:

On 17.03.21 21:32, Brian Andrus wrote:
> That is looking like your /run folder does not have world execute
> permissions, making it impossible for anything to access sub-directories.

But I can write as user "sven" (I didn't set up the LDAP connection,
yet) in a subdirectory of /run/slurm-lnll, if it belongs to user "sven".

Furthermore, I used the option "SlurmUser=slurm" in my slurm.conf file,
because it is good practice to not use root. Changing this to "root",
which should give universal access to all directories, doesn't make a
difference:

#SlurmUser=slurm
SlurmdUser=root

My  initial response, that /var/run/slurm-lnll/slurmctld.pid worked me;
was also premature. It kind of works for the first start after a reboot with

systemctl start slurmctld

and

systemctl stop slurmctld

works, but then lingers around in the timeout. During that time
slurmctld still runs, I see the process, and can use squeue, sinfo etc.

After the pid file writing timeout it shows the service to be
terminated. This time not due to the inability of writing the
slurmctld.pid file, but instead suggesting my modification to the legacy
location /var/run - which itself is only a reference to /run:

Mar 18 12:30:43 slurm systemd[1]: Reloading.
Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/dbus.socket:5:
ListenStream= references a path below legacy directory /var/run/,
updating /var/>
Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/slurmd.service:12:
PIDFile= references a path below legacy directory /var/run/, updating
/var/r>
Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.

time systemctl start slurmctld
Job for slurmctld.service failed because a timeout was exceeded.
See "systemctl status slurmctld.service" and "journalctl -xe" for details.

real    1m1.314s
user    0m0.003s
sys    0m0.002s

-- A session with the ID 1 has been terminated.
Mar 18 12:30:43 slurm systemd[1]: Reloading.
Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/dbus.socket:5:
ListenStream= references a path below legacy directory /var/run/,
updating /var/>
Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/slurmd.service:12:
PIDFile= references a path below legacy directory /var/run/, updating
/var/r>
Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.

The initial "&" I put after the systemctl, because I wanted to get to my
prompt to investigate the problem. Normal behaviour, as I expect it,
would be a starting time of 1-2 seconds.

I am back to my work-around:

systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` >
/run/slurm-lnll/slurmctld.pid && chown slurm:
/run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid

My configuration file is read, though, as I can check with scontrol:

scontrol show config | grep run
SlurmdPidFile           = /var/run/slurm-llnl/slurmd.pid
SlurmctldPidFile        = /var/run/slurm-llnl/slurmctld.pid

So, all of this hassle shouldn't occur, my fiddling with systemd should
be entirely unnecessary.

Mar 18 12:37:13 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 18 12:38:43 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 18 12:38:43 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.

Unmodified systemd file:

[Unit]
Description=Slurm controller daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm-llnl/slurm.conf
Documentation=man:slurmctld(8)

[Service]
Type=forking
EnvironmentFile=-/etc/default/slurmctld
ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/run/slurm-lnll/slurmctld.pid
LimitNOFILE=65536
TasksMax=infinity

[Install]
WantedBy=multi-user.target
~                            

I do know some file permissions issues, I encountered on CentOS-7, but
by all apparent means, i.e. checking the permissions, it should work
with those permissions in the subdirectory

ls -lthrd /run/slurm-lnll/
drwxrwxr-x 2 root slurm 40 Mar 18 12:31 /run/slurm-lnll/

But this suggests, it ignores the setting in the slurm.conf file:

SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid

-- The job identifier is 2259.
Mar 18 12:41:34 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 18 12:43:04 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 18 12:43:04 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.

Though scontrol show config claims otherwise:

 scontrol show config | grep run
SlurmdPidFile           = /var/run/slurm-llnl/slurmd.pid
SlurmctldPidFile        = /var/run/slurm-llnl/slurmctld.pid
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)

I would attribute it to my fault, but I started yesterday with a
"vanilla" installation of Ubuntu20.04 server, and the purpose of this VM
is only to run sclurmctld.

This "should" occur to many more people, or I am missing something
obvious. If it was to permissions, making the directory /run/slurm-lnll
world-wirteable:

 ls -lthrd /run/slurm-lnll/
drwxrwxrwx 2 root slurm 40 Mar 18 12:31 /run/slurm-lnll/

should "fix" the problem. I could live with that, even though I try to
adhere to strict permission management.

That also doesn't work

Mar 18 12:46:33 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 18 12:46:38 slurm systemd[1]: Reloading.

So, I am turning in circles here.

Best wishes,

Sven

-- 
Sven Duscha
Deutsches Herzzentrum München
Technische Universität München
Lazarettstraße 36
80636 München
+49 89 1218 2602

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5463 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210318/519fe799/attachment-0001.bin>