<div dir="ltr">i collected the log from slurmctld and it says below<div><br></div><div>[2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3<br>[2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3 RPC:REQUEST_TERMINATE_JOB : Communication connection failure<br>[2020-06-11T07:14:50.210] error: Nodes oled3 not responding<br>[2020-06-11T07:15:54.313] error: Nodes oled3 not responding<br>[2020-06-11T07:17:34.407] error: Nodes oled3 not responding<br>[2020-06-11T07:19:14.637] error: Nodes oled3 not responding<br>[2020-06-11T07:19:54.313] update_node: node oled3 reason set to: reboot-required<br>[2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING*<br>[2020-06-11T07:20:43.788] requeue job 1316970 due to failure of node oled3<br>[2020-06-11T07:20:43.788] requeue job 1349322 due to failure of node oled3<br>[2020-06-11T07:20:43.789] error: Nodes oled3 not responding, setting DOWN<br></div><div><br></div><div>sinfo says</div><div><br></div><div>OLED*      up  infinite    1 drain* oled3<br></div><div><br></div><div>while checking the node i feel node is healthy.</div><div><br></div><div>Regards<br></div><div>Navin</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy <<a href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div lang="EN-GB">
<div class="gmail-m_-1283298587124638887WordSection1">
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Weird. “slurmd -Dvvv†ought to report a whole lot of data; I can’t guess how to interpret it not reporting anything but the “log fileâ€
and “munge†messages. When you have it running attached to your window, is there any chance that sinfo or scontrol suggest that the node is actually all right? Perhaps something in /etc/sysconfig/slurm or the like is messed up?<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)"><u></u>Â <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">If that’s not the case, I think my next step would be to follow up on someone else’s suggestion, and scan the slurmctld.log file for
the problem node name.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)"><u></u>Â <u></u></span></p>
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:11pt;font-family:Calibri,sans-serif">From:</span></b><span lang="EN-US" style="font-size:11pt;font-family:Calibri,sans-serif"> slurm-users [mailto:<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>]
<b>On Behalf Of </b>navin srivastava<br>
<b>Sent:</b> Thursday, June 11, 2020 9:26 AM<br>
<b>To:</b> Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
<b>Subject:</b> Re: [slurm-users] unable to start slurmd process.<u></u><u></u></span></p>
<p class="MsoNormal"><u></u>Â <u></u></p>
<div>
<p class="MsoNormal">Sorry Andy I missed to add.<u></u><u></u></p>
<div>
<p class="MsoNormal">1st i tried the slurmd -Dvvv and it is not written anything<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">slurmd: debug: Â Log file re-opened<br>
slurmd: debug: Â Munge authentication plugin loaded<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal">After that I waited for 10-20 minutes but no output and finally i pressed Ctrl^c.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal">My doubt is in slurm.conf file:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal">ControlMachine=deda1x1466<br>
ControlAddr=192.168.150.253<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal">The deda1x1466 is having a different interface with different IP which compute node is unable to ping but IP is pingable.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">could be one of the reason?<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal">but other nodes having the same config and there i am able to start the slurmd. so bit of confusion.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal">Regards<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Navin.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal">Regards<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Navin.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u>Â <u></u></p>
</div>
</div>
<p class="MsoNormal"><u></u>Â <u></u></p>
<div>
<div>
<p class="MsoNormal">On Thu, Jun 11, 2020 at 6:44 PM Riebs, Andy <<a href="mailto:andy.riebs@hpe.com" target="_blank">andy.riebs@hpe.com</a>> wrote:<u></u><u></u></p>
</div>
<blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in">
<div>
<div>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">If you omitted the “-D†that I suggested, then the daemon would have detached and logged nothing on
the screen. In this case, you can still go to the slurmd log (use “scontrol show config | grep -I log†if you’re not sure where the logs are stored).</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Â </span><u></u><u></u></p>
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:11pt;font-family:Calibri,sans-serif">From:</span></b><span lang="EN-US" style="font-size:11pt;font-family:Calibri,sans-serif"> slurm-users
[mailto:<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>]
<b>On Behalf Of </b>navin srivastava<br>
<b>Sent:</b> Thursday, June 11, 2020 9:01 AM<br>
<b>To:</b> Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
<b>Subject:</b> Re: [slurm-users] unable to start slurmd process.</span><u></u><u></u></p>
<p class="MsoNormal">Â <u></u><u></u></p>
<div>
<p class="MsoNormal">I tried by executing the debug mode but there also it is not writing anything.<u></u><u></u></p>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">i waited for about 5-10 minutes<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v<br>
<br>
No output on terminal. <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">The OS is SLES12-SP4 . All firewall services are disabled.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">The recent change is the local hostname earlier it was with local hostname node1,node2,etc but we have moved to dns based hostname which is deda<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] Sockets=2 CoresPerSocket=10 State=UNKNOWN<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">other than this it is fine but after that i have done several time slurmd process started on the node and it works fine but now i am seeing this issue today.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Regards<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Navin.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
</div>
<p class="MsoNormal">Â <u></u><u></u></p>
<div>
<div>
<p class="MsoNormal">On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy <<a href="mailto:andy.riebs@hpe.com" target="_blank">andy.riebs@hpe.com</a>> wrote:<u></u><u></u></p>
</div>
<blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin:5pt 0in 5pt 4.8pt">
<div>
<div>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Navin,</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Â </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">As you can see, systemd provides very little service-specific information. For slurm, you really need
to go to the slurm logs to find out what happened.</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Â </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Hint: A quick way to identify problems like this with slurmd and slurmctld is to run them with the
“-Dvvv†option, causing them to log to your window, and usually causing the problem to become immediately obvious.
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Â </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">For example,</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Â </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)"># /usr/local/slurm/sbin/slurmd -Dvvvv</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Â </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when you run it this way,
it’s time to look elsewhere.</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Â </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Andy</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">Â </span><u></u><u></u></p>
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:11pt;font-family:Calibri,sans-serif">From:</span></b><span lang="EN-US" style="font-size:11pt;font-family:Calibri,sans-serif"> slurm-users
[mailto:<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>]
<b>On Behalf Of </b>navin srivastava<br>
<b>Sent:</b> Thursday, June 11, 2020 8:25 AM<br>
<b>To:</b> Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
<b>Subject:</b> [slurm-users] unable to start slurmd process.</span><u></u><u></u></p>
<p class="MsoNormal">Â <u></u><u></u></p>
<div>
<p class="MsoNormal">Hi Team,<u></u><u></u></p>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">when i am trying to start the slurmd process i am getting the below error.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon...<br>
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start operation timed out. Terminating.<br>
2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node daemon.<br>
2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered failed state.<br>
2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with result 'timeout'.<br>
2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): session opened for user root by (uid=0)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Slurm version is 17.11.8<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">The server and slurm is running from long time and we have not made any changes but today when i am starting it is giving this error message. <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Any idea what could be wrong here.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Regards<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Navin.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Â <u></u><u></u></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote></div>