[slurm-users] Job continuing to use cpu minutes after completion

Fri Feb 3 18:08:12 UTC 2023

This sounds similar to something I recently experienced and finally figured out in 21.08.

https://lists.schedmd.com/pipermail/slurm-users/2023-January/009594.html <https://lists.schedmd.com/pipermail/slurm-users/2023-January/009594.html>

The long and short of it, is that I had jobs with the clock running, even though they weren’t showing up in squeue, etc.
I ended up requeueing the jobs, and then cancelling them, and they finally fell off the ledger.

Hope thats helpful,
Reed 

> On Feb 3, 2023, at 9:17 AM, Jonathan Casco <jcasco at fiu.edu> wrote:
> 
> Hello,
>  
> We are using Slurm 22.05.6 and have encountered a strange issue with one users jobs where they submitted a job array. The jobs failed and left the queue in the logs but have continued to use CPU minutes well past the job completion. I am using one step as an example here but this is occurring for all the steps within job array.
>  
> Below is a snippet from the slurmctld log for one of the job steps in question:
> [2023-01-25T08:36:40.299] sched/backfill: _start_job: Started JobId=8853669_3(8853785) in <partition> on <node>
> [2023-01-25T08:36:40.599] _job_complete: JobId=8853669_3(8853785) WEXITSTATUS 1
> [2023-01-25T08:36:40.601] _job_complete: JobId=8853669_3(8853785) done
>  
> However when checking the job with sacct I see that the end time is Unknown and the job shows as never completed.
> # sacct -j 8853669_3 --format=start%15,end%15,elapsed%20,state%15
>           Start             End              Elapsed           State 
> --------------- --------------- -------------------- --------------- 
> 2023-01-25T08:3         Unknown           9-01:22:21          FAILED 
>  
> One curious bit in this is that the job ID does not appear in the logs of the node where it is said to have run.
>  
> An scancel of the job does not have an effect and we see the following in the logs when attempting to do so:
> [2023-02-03T08:44:36.072] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=8853669_3 uid <id>
> [2023-02-03T08:44:36.073] job_str_signal(5): invalid JobId=8853669_3
> [2023-02-03T08:44:36.073] _slurm_rpc_kill_job: job_str_signal() uid=<id> JobId=8853669_3 sig=9 returned: Invalid job id specified
>  
> Checking the database everything looks correct there for the job.
> > select time_start,time_end from job_table where id_job="8853669_3";
> +------------+------------+
> | time_start | time_end   |
> +------------+------------+
> | 1674653930 | 1674653931 |
> +------------+------------+
>  
> Both slurmctld and slurmdbd are running so I am at a bit of a loss on how to proceed with getting this job to “end” to the controller so that it can stop consuming cpuminutes.
>  
> Any help would be appreciated, thanks!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230203/f3685984/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3857 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230203/f3685984/attachment.bin>