<div dir="ltr"><div dir="ltr">Hi, <div><br><div>As an update, I able to clear out the orphan/cancelled jobs by rebooting the compute nodes which had cancelled jobs.   The error messages have ceased. </div><div><br></div><div>Regards,</div><div>Jeff</div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Dec 6, 2023 at 8:26 AM Jeffrey McDonald <<a href="mailto:jmcdonal@umn.edu">jmcdonal@umn.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi, <div>Yesterday, an upgrade to slurm from 22.05.4 to 23.11.0 went sideways and I ended up losing a number of jobs on the compute nodes.   Ultimately, the installation seems to be successful but I now have some issues with job remnants it appears.    About once per minute (per job), the slurmctld daemon is logging: </div><div><br></div><div><span style="font-family:monospace"><span style="color:rgb(0,0,0)">[2023-12-06T08:16:32.505] error: slurm_receive_msg [<a href="http://146.57.133.18:39104" target="_blank">146.57.133.18:39104</a>]: Zero Bytes were transmitted or received
</span><br>[2023-12-06T08:16:32.505] error: slurm_receive_msg [<a href="http://146.57.133.18:39106" target="_blank">146.57.133.18:39106</a>]: Zero Bytes were transmitted or received
<br>[2023-12-06T08:16:32.792] error: slurm_receive_msg [<a href="http://146.57.133.38:54722" target="_blank">146.57.133.38:54722</a>]: Zero Bytes were transmitted or received
<br>[2023-12-06T08:16:34.189] error: slurm_receive_msg [<a href="http://146.57.133.49:59058" target="_blank">146.57.133.49:59058</a>]: Zero Bytes were transmitted or received
<br>[2023-12-06T08:16:34.197] error: slurm_receive_msg [<a href="http://146.57.133.49:58232" target="_blank">146.57.133.49:58232</a>]: Zero Bytes were transmitted or received
<br>[2023-12-06T08:16:35.757] error: slurm_receive_msg [<a href="http://146.57.133.39:48856" target="_blank">146.57.133.39:48856</a>]: Zero Bytes were transmitted or received
<br>[2023-12-06T08:16:35.757] error: slurm_receive_msg [<a href="http://146.57.133.39:48860" target="_blank">146.57.133.39:48860</a>]: Zero Bytes were transmitted or received
<br>[2023-12-06T08:16:36.329] error: slurm_receive_msg [<a href="http://146.57.133.46:50848" target="_blank">146.57.133.46:50848</a>]: Zero Bytes were transmitted or received
<br>[2023-12-06T08:16:59.827] error: slurm_receive_msg [<a href="http://146.57.133.14:60328" target="_blank">146.57.133.14:60328</a>]: Zero Bytes were transmitted or received
<br>[2023-12-06T08:16:59.828] error: slurm_receive_msg [<a href="http://146.57.133.37:37734" target="_blank">146.57.133.37:37734</a>]: Zero Bytes were transmitted or received
<br>[2023-12-06T08:17:03.285] error: slurm_receive_msg [<a href="http://146.57.133.35:41426" target="_blank">146.57.133.35:41426</a>]: Zero Bytes were transmitted or received
<br>[2023-12-06T08:17:13.244] error: slurm_receive_msg [<a href="http://146.57.133.105:34416" target="_blank">146.57.133.105:34416</a>]: Zero Bytes were transmitted or received
<br>[2023-12-06T08:17:13.726] error: slurm_receive_msg [<a href="http://146.57.133.15:60164" target="_blank">146.57.133.15:60164</a>]: Zero Bytes were transmitted or received</span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace">The controller also shows orphaned jobs: </span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace"><span style="color:rgb(0,0,0)">[2023-12-06T07:47:42.010] error: </span><span style="font-weight:bold;color:rgb(255,84,84)">Orphan</span><span style="color:rgb(0,0,0)"> StepId=9050.extern reported on node amd03
</span><br>[2023-12-06T07:47:42.010] error: <span style="font-weight:bold;color:rgb(255,84,84)">Orphan</span><span style="color:rgb(0,0,0)"> StepId=9055.extern reported on node amd03
</span><br>[2023-12-06T07:47:42.011] error: <span style="font-weight:bold;color:rgb(255,84,84)">Orphan</span><span style="color:rgb(0,0,0)"> StepId=8862.extern reported on node amd12
</span><br>[2023-12-06T07:47:42.011] error: <span style="font-weight:bold;color:rgb(255,84,84)">Orphan</span><span style="color:rgb(0,0,0)"> StepId=9065.extern reported on node amd07
</span><br>[2023-12-06T07:47:42.011] error: <span style="font-weight:bold;color:rgb(255,84,84)">Orphan</span><span style="color:rgb(0,0,0)"> StepId=9066.extern reported on node amd07
</span><br>[2023-12-06T07:47:42.011] error: <span style="font-weight:bold;color:rgb(255,84,84)">Orphan</span><span style="color:rgb(0,0,0)"> StepId=8987.extern reported on node amd09
</span><br>[2023-12-06T07:47:42.012] error: <span style="font-weight:bold;color:rgb(255,84,84)">Orphan</span><span style="color:rgb(0,0,0)"> StepId=9068.extern reported on node amd08
</span><br>[2023-12-06T07:47:42.012] error: <span style="font-weight:bold;color:rgb(255,84,84)">Orphan</span><span style="color:rgb(0,0,0)"> StepId=8862.extern reported on node amd13
</span><br>[2023-12-06T07:47:42.012] error: <span style="font-weight:bold;color:rgb(255,84,84)">Orphan</span><span style="color:rgb(0,0,0)"> StepId=8774.extern reported on node amd10
</span><br>[2023-12-06T07:47:42.012] error: <span style="font-weight:bold;color:rgb(255,84,84)">Orphan</span><span style="color:rgb(0,0,0)"> StepId=9051.extern reported on node amd10
</span><br>[2023-12-06T07:49:22.009] error: <span style="font-weight:bold;color:rgb(255,84,84)">Orphan</span><span style="color:rgb(0,0,0)"> StepId=9071.extern reported on node aslab01
</span><br>[2023-12-06T07:49:22.010] error: <span style="font-weight:bold;color:rgb(255,84,84)">Orphan</span><span style="color:rgb(0,0,0)"> StepId=8699.extern reported on node gpu05</span><br>
<br></span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace">On the compute nodes, I see  a corresponding error message like this one: </span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace"><span style="color:rgb(0,0,0)">[2023-12-06T08:18:03.292] [9052.extern] error: hash_g_compute: hash plugin with id:0 not exist or is not loaded
</span><br>[2023-12-06T08:18:03.292] [9052.extern] error: slurm_send_node_msg: hash_g_compute: REQUEST_STEP_COMPLETE has error<br>
<br></span></div><div><br></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace">The error seems to be a reference always to a job that was canceled, e.g. 9052: </span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace"><span style="color:rgb(0,0,0)"># sacct -j 9052
</span><br>JobID           JobName  Partition    Account  AllocCPUS      State ExitCode  <br>------------ ---------- ---------- ---------- ---------- ---------- --------  <br>9052         sys/dashb+     a40gpu                    24  CANCELLED      0:0  <br>9052.batch        batch                               24  CANCELLED      0:0  <br>9052.extern      extern                               24  CANCELLED      0:0 <br>
<br></span>These jobs were running at the start of the update but we subsequently canceled because of the slurmd or slurmctld timeouts during the update.    How can I clean this up?    I've tried canceling the jobs but nothing seems to work to remove them.   </div><div><br></div><div>Thanks in advance,</div><div>Jeff</div><div><div><br></div><div dir="ltr" class="gmail_signature"><div dir="ltr"></div></div></div></div>
</blockquote></div><div><br></div><div dir="ltr" class="gmail_signature"><div dir="ltr"></div></div></div>