<div dir="auto"><div>scontrol update nodename=name-of-node state=resume</div><div dir="auto"><br></div><div dir="auto"><br><br><div class="gmail_quote" dir="auto"><div dir="ltr" class="gmail_attr">On Tue, Nov 16, 2021, 1:36 PM Jaep Emmanuel <<a href="mailto:emmanuel.jaep@epfl.ch">emmanuel.jaep@epfl.ch</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">





<div lang="en-CH" link="blue" vlink="purple" style="word-wrap:break-word">
<div class="m_-8756868301862236933WordSection1">
<p class="MsoNormal"><span lang="EN-US">How do you do that?<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">As per documentation, the resume command applies to the job list (<a href="https://slurm.schedmd.com/scontrol.html" target="_blank" rel="noreferrer">https://slurm.schedmd.com/scontrol.html</a>), not to the node.<u></u><u></u></span></p>
<p class="MsoNormal"><span><u></u> <u></u></span></p>
<div style="border:none;border-top:solid #b5c4df 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank" rel="noreferrer">slurm-users-bounces@lists.schedmd.com</a>> on behalf of Stephen Cousins <<a href="mailto:steve.cousins@maine.edu" target="_blank" rel="noreferrer">steve.cousins@maine.edu</a>><br>
<b>Reply to: </b>Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank" rel="noreferrer">slurm-users@lists.schedmd.com</a>><br>
<b>Date: </b>Tuesday, 16 November 2021 at 19:09<br>
<b>To: </b>Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank" rel="noreferrer">slurm-users@lists.schedmd.com</a>><br>
<b>Subject: </b>Re: [slurm-users] Unable to start slurmd service<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">I think you just need to use scontrol to "resume" that node. <u></u><u></u></p>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div>
<p class="MsoNormal">On Tue, Nov 16, 2021, 10:10 AM Jaep Emmanuel <<a href="mailto:emmanuel.jaep@epfl.ch" target="_blank" rel="noreferrer">emmanuel.jaep@epfl.ch</a>> wrote:<u></u><u></u></p>
</div>
<blockquote style="border:none;border-left:solid #cccccc 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<div>
<div>
<p class="MsoNormal"><span lang="FR-CH">Hi,</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="FR-CH"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">It might be a newbie question since I'm new to slurm.</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">I'm trying to restart the slurmd service on one of our Ubuntu box.</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">The slurmd.service is defined by:</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">[Unit]</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">Description=Slurm node daemon</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">After=network.target munge.service</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">ConditionPathExists=/etc/slurm/slurm.conf</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">[Service]</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">Type=forking</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">EnvironmentFile=-/etc/sysconfig/slurmd</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">ExecReload=/bin/kill -HUP $MAINPID</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">PIDFile=/var/run/slurmd.pid</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">KillMode=process</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">LimitNOFILE=51200</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">LimitMEMLOCK=infinity</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">LimitSTACK=infinity</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">[Install]</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">WantedBy=multi-user.target</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">The service start without issue (systemctl start slurmd.service).</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">However, when checking the status of the service, I get a couple of error messages, but nothing alarming:</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">~# systemctl status slurmd.service</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">● slurmd.service - Slurm node daemon</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">     Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">     Active: active (running) since Tue 2021-11-16 15:58:01 CET; 50s ago</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">    Process: 2713019 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   Main PID: 2713021 (slurmd)</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">      Tasks: 1 (limit: 134845)</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">     Memory: 1.9M</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">     CGroup: /system.slice/slurmd.service</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">             └─2713021 /usr/sbin/slurmd -d /usr/sbin/slurmstepd</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">Nov 16 15:58:01 ecpsc10 systemd[1]: Starting Slurm node daemon...</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">Nov 16 15:58:01 ecpsc10 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not pe></span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">Nov 16 15:58:01 ecpsc10 systemd[1]: Started Slurm node daemon.</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">Unfortunately, the node is still seen as down when a issue a 'sinfo':</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">root@ecpsc10:~# sinfo</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">Compute         up   infinite      2   idle ecpsc[11-12]</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US" style="font-family:Symbol">è</span><span lang="EN-US">Compute         up   infinite      1   down ecpsc10</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">FastCompute*    up   infinite      2   idle ecpsf[10-11]</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">When I get the details on this node, I get the following details:</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">root@ecpsc10:~# scontrol show node ecpsc10</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">NodeName=ecpsc10 Arch=x86_64 CoresPerSocket=8</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.00</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   AvailableFeatures=(null)</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   ActiveFeatures=(null)</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   Gres=(null)</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   NodeAddr=ecpsc10 NodeHostName=ecpsc10 Version=17.11</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   OS=Linux 5.8.0-43-generic #49~20.04.1-Ubuntu SMP Fri Feb 5 09:57:56 UTC 2021</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   RealMemory=40195 AllocMem=0 FreeMem=4585 Sockets=2 Boards=1</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   </span><span lang="FR-CH">Partitions=Compute</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="FR-CH">   BootTime=2021-10-25T14:16:35 SlurmdStartTime=2021-11-16T15:58:01</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="FR-CH">   </span><span lang="EN-US">CfgTRES=cpu=16,mem=40195M,billing=16</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   AllocTRES=</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   CapWatts=n/a</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">   Reason=Node unexpectedly rebooted [slurm@2021-11-16T14:41:04]</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">From the reason, I get that the daemon won't reload because the machine was rebooted.</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">However, the /etc/slurm/slurm.conf looks like:</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">root@ecpsc10:~# cat /etc/slurm/slurm.conf | grep -i returntoservice</span><u></u><u></u></p>
<p class="MsoNormal" style="margin-left:36.0pt">
<span lang="EN-US">ReturnToService=2</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">So I'm quite puzzled on the reason why the node will not go back online.</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">Any help will be greatly appreciated.</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">Best,</span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US"> </span><u></u><u></u></p>
<p class="MsoNormal"><span lang="EN-US">Emmanuel</span><u></u><u></u></p>
</div>
</div>
</blockquote>
</div>
</div>
</div>

</blockquote></div></div></div>