<div dir="ltr"><div dir="ltr"><div>After installing SLURM in Ubuntu and before starting the services, I do:</div><br>mkdir -p /var/spool/slurmd<br>mkdir -p /var/lib/slurm-llnl<br>mkdir -p /var/lib/slurm-llnl/slurmd<br>mkdir -p /var/lib/slurm-llnl/slurmctld<br>mkdir -p /var/run/slurm-llnl (You need to change this to /run/slurm-llnl as your location for SlurmdPidFile and SlurmctldPidFile)<br>mkdir -p /var/log/slurm-llnl<br><br>chmod -R 755 /var/spool/slurmd<br>chmod -R 755 /var/lib/slurm-llnl/<br>chmod -R 755 /var/run/slurm-llnl/ (Also here)<br>chmod -R 755 /var/log/slurm-llnl/<br><br>chown -R slurm:slurm /var/spool/slurmd<br>chown -R slurm:slurm /var/lib/slurm-llnl/<br>chown -R slurm:slurm /var/run/slurm-llnl/ (And here)<br>chown -R slurm:slurm /var/log/slurm-llnl/</div><div dir="ltr"><br></div><div>Hope that clarifies something. My first SLURM installations failed because of missing directories and wrong permissions.</div><div><br></div><div>Best!<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">El mié, 17 mar 2021 a las 11:56, Brian Andrus (<<a href="mailto:toomuchit@gmail.com">toomuchit@gmail.com</a>>) escribió:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I am guessing you aren't overly familiar with Linux/systemd since you <br>
have the '&' at the end of your start command.<br>
<br>
Be that as it may, you can see it is a permissions issue. Check <br>
permissions on /run and ensure the slurmctld user is able to write there.<br>
<br>
You can either change the slurmctld user to one that can write there or <br>
change the permissions on the directory to allow the slurmctld user <br>
write access.<br>
<br>
Brian Andrus<br>
<br>
<br>
On 3/17/2021 11:16 AM, Sven Duscha wrote:<br>
> Hi,<br>
><br>
> I experience with SLURM slurmctld an error on Ubuntu20.04, when starting<br>
> the service (through systemctl):<br>
><br>
><br>
> I installed munge and SLURM version 19.05.5-1 through the package<br>
> manager from<br>
> the default repository:<br>
><br>
> apt-get install munge slurm-client slurm-wlm slurm-wlm-doc slurmctld slurmd<br>
><br>
><br>
> systemctl start slurmctld &<br>
> [1] 2735<br>
> 18:55 [root@slurm ~]# systemctl status slurmctld<br>
> ● slurmctld.service - Slurm controller daemon<br>
> Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;<br>
> vendor preset: enabled)<br>
> Active: activating (start) since Wed 2021-03-17 18:55:49 CET; 5s ago<br>
> Docs: man:slurmctld(8)<br>
> Process: 2737 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS<br>
> (code=exited, status=0/SUCCESS)<br>
> Tasks: 12<br>
> Memory: 2.5M<br>
> CGroup: /system.slice/slurmctld.service<br>
> └─2759 /usr/sbin/slurmctld<br>
><br>
> Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...<br>
> Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file<br>
> /run/slurmctld.pid (yet?) after start: Operation not permitted<br>
><br>
><br>
><br>
><br>
> After about 60 seconds slurmctld terminates:<br>
><br>
><br>
> -- A stop job for unit slurmctld.service has finished.<br>
> --<br>
> -- The job identifier is 1043 and the job result is done.<br>
> Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...<br>
> -- Subject: A start job for unit slurmctld.service has begun execution<br>
> -- Defined-By: systemd<br>
> -- Support: <a href="http://www.ubuntu.com/support" rel="noreferrer" target="_blank">http://www.ubuntu.com/support</a><br>
> --<br>
> -- A start job for unit slurmctld.service has begun execution.<br>
> --<br>
> -- The job identifier is 1044.<br>
> Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file<br>
> /run/slurmctld.pid (yet?) after start: Operation not permitted<br>
> Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: start operation<br>
> timed out. Terminating.<br>
> Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: Failed with result<br>
> 'timeout'.<br>
><br>
><br>
><br>
><br>
> My slurm.conf file lists custom PID file locations for slurmctld and slurmd:<br>
> /etc/slurm-llnl/slurm.conf<br>
><br>
> SlurmctldPidFile=/run/slurm-llnl/slurmctld.pid<br>
> SlurmdPidFile=/run/slurm-llnl/slurmd.pid<br>
><br>
><br>
><br>
> Starting the slurmctld executable by hand works fine:<br>
> /usr/sbin/slurmctld &<br>
><br>
> pgrep slurmctld<br>
> 2819<br>
> [1]+ Done /usr/sbin/slurmctld<br>
> pgrep slurmctld<br>
> 2819<br>
> squeue<br>
> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<br>
> sinfo -lNe<br>
> Wed Mar 17 19:01:45 2021<br>
> NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK<br>
> WEIGHT AVAIL_FE REASON<br>
> ekgen1 1 cluster* unknown* 16 2:8:1 480000<br>
> 0 1 (null) none<br>
> ekgen2 1 cluster* down* 16 2:8:1 250000<br>
> 0 1 (null) Not responding<br>
> ekgen3 1 debian unknown* 16 2:8:1 250000<br>
> 0 1 (null) none<br>
> ekgen4 1 cluster* unknown* 16 2:8:1 250000<br>
> 0 1 (null) none<br>
> ekgen5 1 cluster* unknown* 16 2:8:1 250000<br>
> 0 1 (null) none<br>
> ekgen6 1 debian unknown* 16 2:8:1 250000<br>
> 0 1 (null) none<br>
> ekgen7 1 cluster* unknown* 16 2:8:1 250000<br>
> 0 1 (null) none<br>
> ekgen8 1 debian down* 16 2:8:1 250000<br>
> 0 1 (null) Not responding<br>
> ekgen9 1 cluster* unknown* 16 2:8:1 192000<br>
> 0 1 (null) none<br>
><br>
><br>
><br>
> I tried then to modify /lib/systemd/system/slurmd.service<br>
><br>
> cp /lib/systemd/system/slurmd.service<br>
> /lib/systemd/system/slurmd.service.orig<br>
><br>
> changed<br>
> PIDFile=/run/slurmd.pid<br>
> to<br>
> PIDFile=/run/slurm-llnl/slurmd.pid<br>
><br>
> systemctl start slurmctld &<br>
> [1] 1869<br>
> pgrep slurm<br>
> 1875<br>
> squeue<br>
> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<br>
><br>
> after ca. 60 seconds:<br>
><br>
> Job for slurmctld.service failed because a timeout was exceeded.<br>
> See "systemctl status slurmctld.service" and "journalctl -xe" for details<br>
><br>
><br>
> - Subject: A start job for unit packagekit.service has finished successfully<br>
> -- Defined-By: systemd<br>
> -- Support: <a href="http://www.ubuntu.com/support" rel="noreferrer" target="_blank">http://www.ubuntu.com/support</a><br>
> --<br>
> -- A start job for unit packagekit.service has finished successfully.<br>
> --<br>
> -- The job identifier is 586.<br>
> Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: start operation<br>
> timed out. Terminating.<br>
> Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: Failed with result<br>
> 'timeout'.<br>
> -- Subject: Unit failed<br>
> -- Defined-By: systemd<br>
> -- Support: <a href="http://www.ubuntu.com/support" rel="noreferrer" target="_blank">http://www.ubuntu.com/support</a><br>
> --<br>
> -- The unit slurmctld.service has entered the 'failed' state with result<br>
> 'timeout'.<br>
> Mar 17 18:28:08 slurm systemd[1]: Failed to start Slurm controller daemon.<br>
> -- Subject: A start job for unit slurmctld.service has failed<br>
> -- Defined-By: systemd<br>
> -- Support: <a href="http://www.ubuntu.com/support" rel="noreferrer" target="_blank">http://www.ubuntu.com/support</a><br>
> --<br>
> -- A start job for unit slurmctld.service has finished with a failure.<br>
> --<br>
> -- The job identifier is 511 and the job result is failed.<br>
> Mar 17 18:31:18 slurm systemd[1]: Starting Slurm controller daemon...<br>
> -- Subject: A start job for unit slurmctld.service has begun execution<br>
> -- Defined-By: systemd<br>
> -- Support: <a href="http://www.ubuntu.com/support" rel="noreferrer" target="_blank">http://www.ubuntu.com/support</a><br>
> --<br>
> -- A start job for unit slurmctld.service has begun execution.<br>
> --<br>
> -- The job identifier is 662.<br>
> Mar 17 18:31:18 slurm systemd[1]: slurmctld.service: Can't open PID file<br>
> /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted<br>
> Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: start operation<br>
> timed out. Terminating.<br>
> Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: Failed with result<br>
> 'timeout'.<br>
> -- Subject: Unit failed<br>
> -- Defined-By: systemd<br>
> -- Support: <a href="http://www.ubuntu.com/support" rel="noreferrer" target="_blank">http://www.ubuntu.com/support</a><br>
><br>
><br>
><br>
> mkdir /run/slurm-lnll/<br>
> chown slurm: /run/slurm-lnll/<br>
><br>
> ls -lthrd /run/slurm-lnll/<br>
> drwxr-xr-x 2 slurm slurm 40 Mar 17 18:34 /run/slurm-lnll/<br>
><br>
> It doesn't create the PID file<br>
><br>
> ls -lthr /run/slurm-lnll/<br>
> total 0<br>
><br>
><br>
> A work-around, writing the PID manually to the PID file, does work:<br>
><br>
> systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` ><br>
> /run/slurm-lnll/slurmctld.pid && chown slurm:<br>
> /run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid<br>
><br>
><br>
> Still status problem reported:<br>
><br>
> systemctl status slurmctld<br>
> ● slurmctld.service - Slurm controller daemon<br>
> Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;<br>
> vendor preset: enabled)<br>
> Active: active (running) since Wed 2021-03-17 18:37:28 CET; 1min 4s ago<br>
> Docs: man:slurmctld(8)<br>
> Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS<br>
> (code=exited, status=0/SUCCESS)<br>
> Main PID: 2287 (slurmctld)<br>
> Tasks: 7<br>
> Memory: 2.3M<br>
> CGroup: /system.slice/slurmctld.service<br>
> └─2287 /usr/sbin/slurmctld<br>
><br>
> Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...<br>
> Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file<br>
> /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted<br>
> Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.<br>
><br>
><br>
> But the slurmctld process doesn't crash anymore. Stopping the service<br>
> does work:<br>
><br>
><br>
> systemctl stop slurmctld.service<br>
> systemctl status slurmctld<br>
> ● slurmctld.service - Slurm controller daemon<br>
> Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;<br>
> vendor preset: enabled)<br>
> Active: inactive (dead) since Wed 2021-03-17 18:50:47 CET; 1s ago<br>
> Docs: man:slurmctld(8)<br>
> Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS<br>
> (code=exited, status=0/SUCCESS)<br>
> Main PID: 2287 (code=exited, status=0/SUCCESS)<br>
><br>
> Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...<br>
> Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file<br>
> /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted<br>
> Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.<br>
> Mar 17 18:50:47 slurm systemd[1]: Stopping Slurm controller daemon...<br>
> Mar 17 18:50:47 slurm systemd[1]: slurmctld.service: Succeeded.<br>
> Mar 17 18:50:47 slurm systemd[1]: Stopped Slurm controller daemon.<br>
><br>
><br>
><br>
> I am a little astonished that the default package shows this strange<br>
> behaviour regarding slurmctld installed through the package manager.<br>
><br>
> The base installation is Ubuntu 20.04 server installation, where I did<br>
> no modifications apart from installing the SLURM-wlm packages and<br>
> importing my existing configuration and munge.key.<br>
><br>
><br>
> Best wishes,<br>
><br>
> Sven Duscha<br>
><br>
><br>
<br>
</blockquote></div></div>