<div dir="auto">I think you just need to use scontrol to "resume" that node. </div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Nov 16, 2021, 10:10 AM Jaep Emmanuel <<a href="mailto:emmanuel.jaep@epfl.ch">emmanuel.jaep@epfl.ch</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">





<div lang="en-CH" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="m_-3440020188558432755WordSection1">
<p class="MsoNormal"><span lang="FR-CH">Hi,<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="FR-CH"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">It might be a newbie question since I'm new to slurm.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">I'm trying to restart the slurmd service on one of our Ubuntu box.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">The slurmd.service is defined by:<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">[Unit]<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">Description=Slurm node daemon<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">After=network.target munge.service<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">ConditionPathExists=/etc/slurm/slurm.conf<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">[Service]<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">Type=forking<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">EnvironmentFile=-/etc/sysconfig/slurmd<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">ExecReload=/bin/kill -HUP $MAINPID<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">PIDFile=/var/run/slurmd.pid<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">KillMode=process<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">LimitNOFILE=51200<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">LimitMEMLOCK=infinity<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">LimitSTACK=infinity<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">[Install]<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">WantedBy=multi-user.target<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">The service start without issue (systemctl start slurmd.service).<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">However, when checking the status of the service, I get a couple of error messages, but nothing alarming:<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">~# systemctl status slurmd.service<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">● slurmd.service - Slurm node daemon<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">     Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">     Active: active (running) since Tue 2021-11-16 15:58:01 CET; 50s ago<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">    Process: 2713019 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   Main PID: 2713021 (slurmd)<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">      Tasks: 1 (limit: 134845)<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">     Memory: 1.9M<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">     CGroup: /system.slice/slurmd.service<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">             └─2713021 /usr/sbin/slurmd -d /usr/sbin/slurmstepd<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">Nov 16 15:58:01 ecpsc10 systemd[1]: Starting Slurm node daemon...<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">Nov 16 15:58:01 ecpsc10 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not pe><u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">Nov 16 15:58:01 ecpsc10 systemd[1]: Started Slurm node daemon.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">Unfortunately, the node is still seen as down when a issue a 'sinfo':<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">root@ecpsc10:~# sinfo<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">Compute         up   infinite      2   idle ecpsc[11-12]<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="font-family:Wingdings"></span><span lang="EN-US">Compute         up   infinite      1   down ecpsc10<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">FastCompute*    up   infinite      2   idle ecpsf[10-11]<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">When I get the details on this node, I get the following details:<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">root@ecpsc10:~# scontrol show node ecpsc10<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">NodeName=ecpsc10 Arch=x86_64 CoresPerSocket=8<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.00<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   AvailableFeatures=(null)<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   ActiveFeatures=(null)<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   Gres=(null)<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   NodeAddr=ecpsc10 NodeHostName=ecpsc10 Version=17.11<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   OS=Linux 5.8.0-43-generic #49~20.04.1-Ubuntu SMP Fri Feb 5 09:57:56 UTC 2021<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   RealMemory=40195 AllocMem=0 FreeMem=4585 Sockets=2 Boards=1<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   </span><span lang="FR-CH">Partitions=Compute<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="FR-CH">   BootTime=2021-10-25T14:16:35 SlurmdStartTime=2021-11-16T15:58:01<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="FR-CH">   </span><span lang="EN-US">CfgTRES=cpu=16,mem=40195M,billing=16<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   AllocTRES=<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   CapWatts=n/a<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">   Reason=Node unexpectedly rebooted [slurm@2021-11-16T14:41:04]<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">From the reason, I get that the daemon won't reload because the machine was rebooted.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">However, the /etc/slurm/slurm.conf looks like:<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">root@ecpsc10:~# cat /etc/slurm/slurm.conf | grep -i returntoservice<u></u><u></u></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US">ReturnToService=2<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">So I'm quite puzzled on the reason why the node will not go back online.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">Any help will be greatly appreciated.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">Best,<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">Emmanuel<u></u><u></u></span></p>
</div>
</div>

</blockquote></div>