<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"Times New Roman \(Body CS\)";
panose-1:2 11 6 4 2 2 2 2 2 4;}
@font-face
{font-family:Menlo;
panose-1:2 11 6 9 3 8 4 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:12.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;
font-weight:normal;
font-style:normal;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:12.0pt;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">Hello,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">We are using Slurm 22.05.6 and have encountered a strange issue with one users jobs where they submitted a job array. The jobs failed and left the queue in the logs but have continued to use CPU minutes well
past the job completion. I am using one step as an example here but this is occurring for all the steps within job array.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Below is a snippet from the slurmctld log for one of the job steps in question:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2023-01-25T08:36:40.299] sched/backfill: _start_job: Started JobId=8853669_3(8853785) in <partition> on <node><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2023-01-25T08:36:40.599] _job_complete: JobId=8853669_3(8853785) WEXITSTATUS 1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2023-01-25T08:36:40.601] _job_complete: JobId=8853669_3(8853785) done<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">However when checking the job with sacct I see that the end time is Unknown and the job shows as never completed.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"># sacct -j 8853669_3 --format=start%15,end%15,elapsed%20,state%15<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> Start End Elapsed State <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">--------------- --------------- -------------------- --------------- <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">2023-01-25T08:3 Unknown 9-01:22:21 FAILED <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">One curious bit in this is that the job ID does not appear in the logs of the node where it is said to have run.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">An scancel of the job does not have an effect and we see the following in the logs when attempting to do so:
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2023-02-03T08:44:36.072] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=8853669_3 uid <id><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2023-02-03T08:44:36.073] job_str_signal(5): invalid JobId=8853669_3<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2023-02-03T08:44:36.073] _slurm_rpc_kill_job: job_str_signal() uid=<id> JobId=8853669_3 sig=9 returned: Invalid job id specified<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Checking the database everything looks correct there for the job.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">> select time_start,time_end from job_table where id_job="8853669_3";<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+------------+------------+<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">| time_start | time_end |<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+------------+------------+<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">| 1674653930 | 1674653931 |<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+------------+------------+<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Both slurmctld and slurmdbd are running so I am at a bit of a loss on how to proceed with getting this job to “end” to the controller so that it can stop consuming cpuminutes.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Any help would be appreciated, thanks!</span><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>