[slurm-users] Job continuing to use cpu minutes after completion

Fri Feb 3 21:33:58 UTC 2023

Hi Reed,

Thank you for that information. I gave the requeue a try however it did not work as the scheduler did not recognize the job ID.
# scontrol requeue 8853669_3
8853669_3: Invalid job id specified

I tried with a few other job steps but saw the same error. It looks like the scheduler is not in agreement with the database over this batch of jobs which is odd. A restart of the daemons did not do the trick either unfortunately.

From: Reed Dier <reed.dier at focusvq.com>
Date: Friday, February 3, 2023 at 1:08 PM
To: Jonathan Casco <jcasco at fiu.edu>
Cc: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Job continuing to use cpu minutes after completion
This sounds similar to something I recently experienced and finally figured out in 21.08.

https://lists.schedmd.com/pipermail/slurm-users/2023-January/009594.html

The long and short of it, is that I had jobs with the clock running, even though they weren’t showing up in squeue, etc.
I ended up requeueing the jobs, and then cancelling them, and they finally fell off the ledger.

Hope thats helpful,
Reed

On Feb 3, 2023, at 9:17 AM, Jonathan Casco <jcasco at fiu.edu<mailto:jcasco at fiu.edu>> wrote:

Hello,

We are using Slurm 22.05.6 and have encountered a strange issue with one users jobs where they submitted a job array. The jobs failed and left the queue in the logs but have continued to use CPU minutes well past the job completion. I am using one step as an example here but this is occurring for all the steps within job array.

Below is a snippet from the slurmctld log for one of the job steps in question:
[2023-01-25T08:36:40.299] sched/backfill: _start_job: Started JobId=8853669_3(8853785) in <partition> on <node>
[2023-01-25T08:36:40.599] _job_complete: JobId=8853669_3(8853785) WEXITSTATUS 1
[2023-01-25T08:36:40.601] _job_complete: JobId=8853669_3(8853785) done

However when checking the job with sacct I see that the end time is Unknown and the job shows as never completed.
# sacct -j 8853669_3 --format=start%15,end%15,elapsed%20,state%15
          Start             End              Elapsed           State
--------------- --------------- -------------------- ---------------
2023-01-25T08:3         Unknown           9-01:22:21          FAILED

One curious bit in this is that the job ID does not appear in the logs of the node where it is said to have run.

An scancel of the job does not have an effect and we see the following in the logs when attempting to do so:
[2023-02-03T08:44:36.072] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=8853669_3 uid <id>
[2023-02-03T08:44:36.073] job_str_signal(5): invalid JobId=8853669_3
[2023-02-03T08:44:36.073] _slurm_rpc_kill_job: job_str_signal() uid=<id> JobId=8853669_3 sig=9 returned: Invalid job id specified

Checking the database everything looks correct there for the job.
> select time_start,time_end from job_table where id_job="8853669_3";
+------------+------------+
| time_start | time_end   |
+------------+------------+
| 1674653930 | 1674653931 |
+------------+------------+

Both slurmctld and slurmdbd are running so I am at a bit of a loss on how to proceed with getting this job to “end” to the controller so that it can stop consuming cpuminutes.

Any help would be appreciated, thanks!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230203/136ade9c/attachment-0001.htm>