[slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl

Michael Gutteridge michael.gutteridge at gmail.com
Thu Mar 18 21:55:21 UTC 2021


I would also encourage you to use defaults in the slurm.conf (matching
what's shipped in the Ubuntu packages).  However, here is what I've done to
use non-Ubuntu-package paths for the PID files.

Create an override in /etc/systemd/system/slurmd.service.d/override.conf
with something like:

node32[~]: cat /etc/systemd/system/slurmd.service.d/override.conf
[Service]
PIDFile=/var/run/slurm-llnl/slurmd.pid
RuntimeDirectory=slurm-llnl
RuntimeDirectoryMode=0775

Replace the daemon name as necessary.  The "runtimedirectory" is needed
because /run and /var/run are virtual file systems managed by systemd.
Creating that directory "by hand" has unpredictable results.

HTH

 - Michael


On Thu, Mar 18, 2021 at 4:52 AM Sven Duscha <sven.duscha at tum.de> wrote:

> Hi,
>
> thanks for all the responses.
>
> On 18.03.21 11:29, Stefan Staeglich wrote:
> > I think it makes more sense to adjust the config file
> /etc/slurm-llnl/slurm.conf
> > and not the systemd units:
> > SlurmctldPidFile=/run/slurmctld.pid
> > SlurmdPidFile=/run/slurmd.pid
>
>
> That was of course my first approach. I had used the directory
> /run/slurm-lnll/ on my CentOS 7 installations, where I copied the
> slurm.conf file from over.
>
> It turned out that those directories I defined there weren't used. The
> error message suggested that slurmctld still tried to write to
> /run/slurmctld.pid.
>
> Changing the systemd file was my last resort. And as mentioned I don't
> expect to have to do that much fiddling with an (relative old 19.05-5)
> package manager version.  It seems "snap" provides a more current
> version 20.02.1:
>
>
> snap install slurm         # version 20.02.1, or
> apt  install slurm-client  # version 19.05.5-1
>
>
> The underlying distribution installation also hasn't been modified by
> me, I want to use Ubuntu20.04 as my future cluster OS, and the
> kvm-virtualized SLURM controller was the first I tried.
>
>
> Brian Andrus suggested:
>
> On 17.03.21 21:32, Brian Andrus wrote:
> > That is looking like your /run folder does not have world execute
> > permissions, making it impossible for anything to access sub-directories.
>
> But I can write as user "sven" (I didn't set up the LDAP connection,
> yet) in a subdirectory of /run/slurm-lnll, if it belongs to user "sven".
>
>
> Furthermore, I used the option "SlurmUser=slurm" in my slurm.conf file,
> because it is good practice to not use root. Changing this to "root",
> which should give universal access to all directories, doesn't make a
> difference:
>
> #SlurmUser=slurm
> SlurmdUser=root
>
>
> My  initial response, that /var/run/slurm-lnll/slurmctld.pid worked me;
> was also premature. It kind of works for the first start after a reboot
> with
>
> systemctl start slurmctld
>
> and
>
> systemctl stop slurmctld
>
> works, but then lingers around in the timeout. During that time
> slurmctld still runs, I see the process, and can use squeue, sinfo etc.
>
> After the pid file writing timeout it shows the service to be
> terminated. This time not due to the inability of writing the
> slurmctld.pid file, but instead suggesting my modification to the legacy
> location /var/run - which itself is only a reference to /run:
>
> Mar 18 12:30:43 slurm systemd[1]: Reloading.
> Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/dbus.socket:5:
> ListenStream= references a path below legacy directory /var/run/,
> updating /var/>
> Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/slurmd.service:12:
> PIDFile= references a path below legacy directory /var/run/, updating
> /var/r>
> Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: start operation
> timed out. Terminating.
> Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: Failed with result
> 'timeout'.
>
>
> time systemctl start slurmctld
> Job for slurmctld.service failed because a timeout was exceeded.
> See "systemctl status slurmctld.service" and "journalctl -xe" for details.
>
> real    1m1.314s
> user    0m0.003s
> sys    0m0.002s
>
> -- A session with the ID 1 has been terminated.
> Mar 18 12:30:43 slurm systemd[1]: Reloading.
> Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/dbus.socket:5:
> ListenStream= references a path below legacy directory /var/run/,
> updating /var/>
> Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/slurmd.service:12:
> PIDFile= references a path below legacy directory /var/run/, updating
> /var/r>
> Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: start operation
> timed out. Terminating.
> Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: Failed with result
> 'timeout'.
>
>
> The initial "&" I put after the systemctl, because I wanted to get to my
> prompt to investigate the problem. Normal behaviour, as I expect it,
> would be a starting time of 1-2 seconds.
>
>
> I am back to my work-around:
>
> systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` >
> /run/slurm-lnll/slurmctld.pid && chown slurm:
> /run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid
>
>
> My configuration file is read, though, as I can check with scontrol:
>
> scontrol show config | grep run
> SlurmdPidFile           = /var/run/slurm-llnl/slurmd.pid
> SlurmctldPidFile        = /var/run/slurm-llnl/slurmctld.pid
>
>
> So, all of this hassle shouldn't occur, my fiddling with systemd should
> be entirely unnecessary.
>
> Mar 18 12:37:13 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
> Mar 18 12:38:43 slurm systemd[1]: slurmctld.service: start operation
> timed out. Terminating.
> Mar 18 12:38:43 slurm systemd[1]: slurmctld.service: Failed with result
> 'timeout'.
>
>
> Unmodified systemd file:
>
> [Unit]
> Description=Slurm controller daemon
> After=network.target munge.service
> ConditionPathExists=/etc/slurm-llnl/slurm.conf
> Documentation=man:slurmctld(8)
>
> [Service]
> Type=forking
> EnvironmentFile=-/etc/default/slurmctld
> ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> ExecReload=/bin/kill -HUP $MAINPID
> PIDFile=/run/slurm-lnll/slurmctld.pid
> LimitNOFILE=65536
> TasksMax=infinity
>
> [Install]
> WantedBy=multi-user.target
> ~
>
>
> I do know some file permissions issues, I encountered on CentOS-7, but
> by all apparent means, i.e. checking the permissions, it should work
> with those permissions in the subdirectory
>
> ls -lthrd /run/slurm-lnll/
> drwxrwxr-x 2 root slurm 40 Mar 18 12:31 /run/slurm-lnll/
>
>
> But this suggests, it ignores the setting in the slurm.conf file:
>
> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
>
>
> -- The job identifier is 2259.
> Mar 18 12:41:34 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
> Mar 18 12:43:04 slurm systemd[1]: slurmctld.service: start operation
> timed out. Terminating.
> Mar 18 12:43:04 slurm systemd[1]: slurmctld.service: Failed with result
> 'timeout'.
>
>
> Though scontrol show config claims otherwise:
>
>  scontrol show config | grep run
> SlurmdPidFile           = /var/run/slurm-llnl/slurmd.pid
> SlurmctldPidFile        = /var/run/slurm-llnl/slurmctld.pid
> SrunEpilog              = (null)
> SrunPortRange           = 0-0
> SrunProlog              = (null)
>
>
> I would attribute it to my fault, but I started yesterday with a
> "vanilla" installation of Ubuntu20.04 server, and the purpose of this VM
> is only to run sclurmctld.
>
>
> This "should" occur to many more people, or I am missing something
> obvious. If it was to permissions, making the directory /run/slurm-lnll
> world-wirteable:
>
>  ls -lthrd /run/slurm-lnll/
> drwxrwxrwx 2 root slurm 40 Mar 18 12:31 /run/slurm-lnll/
>
> should "fix" the problem. I could live with that, even though I try to
> adhere to strict permission management.
>
> That also doesn't work
>
> Mar 18 12:46:33 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
> Mar 18 12:46:38 slurm systemd[1]: Reloading.
>
>
> So, I am turning in circles here.
>
>
> Best wishes,
>
> Sven
>
>
> --
> Sven Duscha
> Deutsches Herzzentrum München
> Technische Universität München
> Lazarettstraße 36
> 80636 München
> +49 89 1218 2602
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210318/bb5133e8/attachment.htm>


More information about the slurm-users mailing list