<div dir="ltr"><div class="gmail_default" style="font-family:monospace">I would also encourage you to use defaults in the slurm.conf (matching what's shipped in the Ubuntu packages). However, here is what I've done to use non-Ubuntu-package paths for the PID files.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">Create an override in /etc/systemd/system/slurmd.service.d/override.conf with something like:</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">node32[~]: cat /etc/systemd/system/slurmd.service.d/override.conf <br>[Service]<br>PIDFile=/var/run/slurm-llnl/slurmd.pid<br></div><div class="gmail_default" style="font-family:monospace">RuntimeDirectory=slurm-llnl<br>RuntimeDirectoryMode=0775<br></div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">Replace the daemon name as necessary. The "runtimedirectory" is needed because /run and /var/run are virtual file systems managed by systemd. Creating that directory "by hand" has unpredictable results.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">HTH</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace"> - Michael</div><div class="gmail_default" style="font-family:monospace"><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Mar 18, 2021 at 4:52 AM Sven Duscha <<a href="mailto:sven.duscha@tum.de">sven.duscha@tum.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
thanks for all the responses.<br>
<br>
On 18.03.21 11:29, Stefan Staeglich wrote:<br>
> I think it makes more sense to adjust the config file /etc/slurm-llnl/slurm.conf <br>
> and not the systemd units:<br>
> SlurmctldPidFile=/run/slurmctld.pid<br>
> SlurmdPidFile=/run/slurmd.pid<br>
<br>
<br>
That was of course my first approach. I had used the directory<br>
/run/slurm-lnll/ on my CentOS 7 installations, where I copied the<br>
slurm.conf file from over.<br>
<br>
It turned out that those directories I defined there weren't used. The<br>
error message suggested that slurmctld still tried to write to<br>
/run/slurmctld.pid.<br>
<br>
Changing the systemd file was my last resort. And as mentioned I don't<br>
expect to have to do that much fiddling with an (relative old 19.05-5)<br>
package manager version. It seems "snap" provides a more current<br>
version 20.02.1:<br>
<br>
<br>
snap install slurm # version 20.02.1, or<br>
apt install slurm-client # version 19.05.5-1<br>
<br>
<br>
The underlying distribution installation also hasn't been modified by<br>
me, I want to use Ubuntu20.04 as my future cluster OS, and the<br>
kvm-virtualized SLURM controller was the first I tried.<br>
<br>
<br>
Brian Andrus suggested:<br>
<br>
On 17.03.21 21:32, Brian Andrus wrote:<br>
> That is looking like your /run folder does not have world execute<br>
> permissions, making it impossible for anything to access sub-directories.<br>
<br>
But I can write as user "sven" (I didn't set up the LDAP connection,<br>
yet) in a subdirectory of /run/slurm-lnll, if it belongs to user "sven".<br>
<br>
<br>
Furthermore, I used the option "SlurmUser=slurm" in my slurm.conf file,<br>
because it is good practice to not use root. Changing this to "root",<br>
which should give universal access to all directories, doesn't make a<br>
difference:<br>
<br>
#SlurmUser=slurm<br>
SlurmdUser=root<br>
<br>
<br>
My initial response, that /var/run/slurm-lnll/slurmctld.pid worked me;<br>
was also premature. It kind of works for the first start after a reboot with<br>
<br>
systemctl start slurmctld<br>
<br>
and<br>
<br>
systemctl stop slurmctld<br>
<br>
works, but then lingers around in the timeout. During that time<br>
slurmctld still runs, I see the process, and can use squeue, sinfo etc.<br>
<br>
After the pid file writing timeout it shows the service to be<br>
terminated. This time not due to the inability of writing the<br>
slurmctld.pid file, but instead suggesting my modification to the legacy<br>
location /var/run - which itself is only a reference to /run:<br>
<br>
Mar 18 12:30:43 slurm systemd[1]: Reloading.<br>
Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/dbus.socket:5:<br>
ListenStream= references a path below legacy directory /var/run/,<br>
updating /var/><br>
Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/slurmd.service:12:<br>
PIDFile= references a path below legacy directory /var/run/, updating<br>
/var/r><br>
Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: start operation<br>
timed out. Terminating.<br>
Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: Failed with result<br>
'timeout'.<br>
<br>
<br>
time systemctl start slurmctld<br>
Job for slurmctld.service failed because a timeout was exceeded.<br>
See "systemctl status slurmctld.service" and "journalctl -xe" for details.<br>
<br>
real 1m1.314s<br>
user 0m0.003s<br>
sys 0m0.002s<br>
<br>
-- A session with the ID 1 has been terminated.<br>
Mar 18 12:30:43 slurm systemd[1]: Reloading.<br>
Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/dbus.socket:5:<br>
ListenStream= references a path below legacy directory /var/run/,<br>
updating /var/><br>
Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/slurmd.service:12:<br>
PIDFile= references a path below legacy directory /var/run/, updating<br>
/var/r><br>
Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: start operation<br>
timed out. Terminating.<br>
Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: Failed with result<br>
'timeout'.<br>
<br>
<br>
The initial "&" I put after the systemctl, because I wanted to get to my<br>
prompt to investigate the problem. Normal behaviour, as I expect it,<br>
would be a starting time of 1-2 seconds.<br>
<br>
<br>
I am back to my work-around:<br>
<br>
systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` ><br>
/run/slurm-lnll/slurmctld.pid && chown slurm:<br>
/run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid<br>
<br>
<br>
My configuration file is read, though, as I can check with scontrol:<br>
<br>
scontrol show config | grep run<br>
SlurmdPidFile = /var/run/slurm-llnl/slurmd.pid<br>
SlurmctldPidFile = /var/run/slurm-llnl/slurmctld.pid<br>
<br>
<br>
So, all of this hassle shouldn't occur, my fiddling with systemd should<br>
be entirely unnecessary.<br>
<br>
Mar 18 12:37:13 slurm systemd[1]: slurmctld.service: Can't open PID file<br>
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted<br>
Mar 18 12:38:43 slurm systemd[1]: slurmctld.service: start operation<br>
timed out. Terminating.<br>
Mar 18 12:38:43 slurm systemd[1]: slurmctld.service: Failed with result<br>
'timeout'.<br>
<br>
<br>
Unmodified systemd file:<br>
<br>
[Unit]<br>
Description=Slurm controller daemon<br>
After=network.target munge.service<br>
ConditionPathExists=/etc/slurm-llnl/slurm.conf<br>
Documentation=man:slurmctld(8)<br>
<br>
[Service]<br>
Type=forking<br>
EnvironmentFile=-/etc/default/slurmctld<br>
ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS<br>
ExecReload=/bin/kill -HUP $MAINPID<br>
PIDFile=/run/slurm-lnll/slurmctld.pid<br>
LimitNOFILE=65536<br>
TasksMax=infinity<br>
<br>
[Install]<br>
WantedBy=multi-user.target<br>
~ <br>
<br>
<br>
I do know some file permissions issues, I encountered on CentOS-7, but<br>
by all apparent means, i.e. checking the permissions, it should work<br>
with those permissions in the subdirectory<br>
<br>
ls -lthrd /run/slurm-lnll/<br>
drwxrwxr-x 2 root slurm 40 Mar 18 12:31 /run/slurm-lnll/<br>
<br>
<br>
But this suggests, it ignores the setting in the slurm.conf file:<br>
<br>
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid<br>
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid<br>
<br>
<br>
-- The job identifier is 2259.<br>
Mar 18 12:41:34 slurm systemd[1]: slurmctld.service: Can't open PID file<br>
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted<br>
Mar 18 12:43:04 slurm systemd[1]: slurmctld.service: start operation<br>
timed out. Terminating.<br>
Mar 18 12:43:04 slurm systemd[1]: slurmctld.service: Failed with result<br>
'timeout'.<br>
<br>
<br>
Though scontrol show config claims otherwise:<br>
<br>
scontrol show config | grep run<br>
SlurmdPidFile = /var/run/slurm-llnl/slurmd.pid<br>
SlurmctldPidFile = /var/run/slurm-llnl/slurmctld.pid<br>
SrunEpilog = (null)<br>
SrunPortRange = 0-0<br>
SrunProlog = (null)<br>
<br>
<br>
I would attribute it to my fault, but I started yesterday with a<br>
"vanilla" installation of Ubuntu20.04 server, and the purpose of this VM<br>
is only to run sclurmctld.<br>
<br>
<br>
This "should" occur to many more people, or I am missing something<br>
obvious. If it was to permissions, making the directory /run/slurm-lnll<br>
world-wirteable:<br>
<br>
ls -lthrd /run/slurm-lnll/<br>
drwxrwxrwx 2 root slurm 40 Mar 18 12:31 /run/slurm-lnll/<br>
<br>
should "fix" the problem. I could live with that, even though I try to<br>
adhere to strict permission management.<br>
<br>
That also doesn't work<br>
<br>
Mar 18 12:46:33 slurm systemd[1]: slurmctld.service: Can't open PID file<br>
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted<br>
Mar 18 12:46:38 slurm systemd[1]: Reloading.<br>
<br>
<br>
So, I am turning in circles here.<br>
<br>
<br>
Best wishes,<br>
<br>
Sven<br>
<br>
<br>
-- <br>
Sven Duscha<br>
Deutsches Herzzentrum München<br>
Technische Universität München<br>
Lazarettstraße 36<br>
80636 München<br>
+49 89 1218 2602<br>
<br>
<br>
</blockquote></div>