[slurm-users] Job continuing to use cpu minutes after completion

Fri Feb 3 15:17:14 UTC 2023

Hello,

We are using Slurm 22.05.6 and have encountered a strange issue with one users jobs where they submitted a job array. The jobs failed and left the queue in the logs but have continued to use CPU minutes well past the job completion. I am using one step as an example here but this is occurring for all the steps within job array.

Below is a snippet from the slurmctld log for one of the job steps in question:
[2023-01-25T08:36:40.299] sched/backfill: _start_job: Started JobId=8853669_3(8853785) in <partition> on <node>
[2023-01-25T08:36:40.599] _job_complete: JobId=8853669_3(8853785) WEXITSTATUS 1
[2023-01-25T08:36:40.601] _job_complete: JobId=8853669_3(8853785) done

However when checking the job with sacct I see that the end time is Unknown and the job shows as never completed.
# sacct -j 8853669_3 --format=start%15,end%15,elapsed%20,state%15
          Start             End              Elapsed           State
--------------- --------------- -------------------- ---------------
2023-01-25T08:3         Unknown           9-01:22:21          FAILED

One curious bit in this is that the job ID does not appear in the logs of the node where it is said to have run.

An scancel of the job does not have an effect and we see the following in the logs when attempting to do so:
[2023-02-03T08:44:36.072] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=8853669_3 uid <id>
[2023-02-03T08:44:36.073] job_str_signal(5): invalid JobId=8853669_3
[2023-02-03T08:44:36.073] _slurm_rpc_kill_job: job_str_signal() uid=<id> JobId=8853669_3 sig=9 returned: Invalid job id specified

Checking the database everything looks correct there for the job.
> select time_start,time_end from job_table where id_job="8853669_3";
+------------+------------+
| time_start | time_end   |
+------------+------------+
| 1674653930 | 1674653931 |
+------------+------------+

Both slurmctld and slurmdbd are running so I am at a bit of a loss on how to proceed with getting this job to “end” to the controller so that it can stop consuming cpuminutes.

Any help would be appreciated, thanks!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230203/bbbc7da5/attachment.htm>