[slurm-users] Job continuing to use cpu minutes after completion
Jonathan Casco
jcasco at fiu.edu
Fri Feb 3 15:17:14 UTC 2023
Hello,
We are using Slurm 22.05.6 and have encountered a strange issue with one users jobs where they submitted a job array. The jobs failed and left the queue in the logs but have continued to use CPU minutes well past the job completion. I am using one step as an example here but this is occurring for all the steps within job array.
Below is a snippet from the slurmctld log for one of the job steps in question:
[2023-01-25T08:36:40.299] sched/backfill: _start_job: Started JobId=8853669_3(8853785) in <partition> on <node>
[2023-01-25T08:36:40.599] _job_complete: JobId=8853669_3(8853785) WEXITSTATUS 1
[2023-01-25T08:36:40.601] _job_complete: JobId=8853669_3(8853785) done
However when checking the job with sacct I see that the end time is Unknown and the job shows as never completed.
# sacct -j 8853669_3 --format=start%15,end%15,elapsed%20,state%15
Start End Elapsed State
--------------- --------------- -------------------- ---------------
2023-01-25T08:3 Unknown 9-01:22:21 FAILED
One curious bit in this is that the job ID does not appear in the logs of the node where it is said to have run.
An scancel of the job does not have an effect and we see the following in the logs when attempting to do so:
[2023-02-03T08:44:36.072] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=8853669_3 uid <id>
[2023-02-03T08:44:36.073] job_str_signal(5): invalid JobId=8853669_3
[2023-02-03T08:44:36.073] _slurm_rpc_kill_job: job_str_signal() uid=<id> JobId=8853669_3 sig=9 returned: Invalid job id specified
Checking the database everything looks correct there for the job.
> select time_start,time_end from job_table where id_job="8853669_3";
+------------+------------+
| time_start | time_end |
+------------+------------+
| 1674653930 | 1674653931 |
+------------+------------+
Both slurmctld and slurmdbd are running so I am at a bit of a loss on how to proceed with getting this job to “end” to the controller so that it can stop consuming cpuminutes.
Any help would be appreciated, thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230203/bbbc7da5/attachment.htm>
More information about the slurm-users
mailing list