Hello,
I replicated this issue on a different cluster and determined that the root cause is that the time_eligible in the underlying MySQL database gets set to 0 when a running job is held. Let me demonstrate.
1. Allocate a job and check that I can query it via `sacct -S YYYY-MM-DD`
jess@bcm10-h01:~$ srun --pty bash jess@bcm10-n01:~$ squeue JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_M 114 defq bash jess R 1:13 1 1 2900M
root@bcm10-h01:~# sacct -S 2026-01-06 -a JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 114 bash defq allusers 1 RUNNING 0:0 114.0 bash allusers 1 RUNNING 0:0
root@bcm10-h01:~# scontrol show jobid=114 | grep EligibleTime SubmitTime=2026-01-06T14:52:04* EligibleTime=2026-01-06T14:52:04*
2. Hold job, confirm that it is no longer queryable via `sacct -S YYYY-MM-DD`, notice EligibleTime changes to Unknown.
jess@bcm10-n01:~$ scontrol hold 114 jess@bcm10-n01:~$ scontrol release 114
root@bcm10-h01:~# sacct -S 2026-01-06 -a JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- --------
root@bcm10-h01:~# scontrol show jobid=114 | grep EligibleTime SubmitTime=2026-01-06T14:52:04 *EligibleTime=Unknown*
3. Check time_eligible in the underlying MySQL database and confirm that changing time_eligible makes it queryable via `sacct -S YYYY-MM-DD`.
root@bcm10-h01:~# mysql --host=localhost --user=slurm --password=XYZ slurm_acct_db mysql> SELECT id_job FROM slurm_job_table WHERE time_eligible = 0; +--------+ | id_job | +--------+ | *114* | | 112 | | 113 | +--------+ 3 rows in set (0.00 sec)
mysql> UPDATE slurm_job_table SET time_eligible = 1767733491 WHERE id_job = 114; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0
mysql> SELECT time_eligible FROM slurm_job_table WHERE id_job = 114; +---------------+ | time_eligible | +---------------+ | 1767733491 | +---------------+ 1 row in set (0.00 sec)
### WORKS AGAIN root@bcm10-h01:~# sacct -S 2026-01-06 -a JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 114 bash defq allusers 1 RUNNING 0:0 114.0 bash allusers 1 RUNNING 0:0
4. In the man page for sacct, it says things like :
"For example jobs submitted with the "--hold" option will have "EligibleTime=Unknown" as they are pending indefinitely."
*Conclusion : * This very much feels like a *bug*. It doesn't seem like running jobs should be able to be 'held' because they can't be pending indefinitely due to the fact that they are actively running. I don't think that the EligibleTime should subsequently change when a user tries to 'hold' a running job either.
*Question : * 1. Identifying these problematic jobs via the underlying MySQL database seems not optimal. Are there any better workarounds?
Best regards, Lee
On Mon, Dec 15, 2025 at 2:33 PM Lee leewithemily@gmail.com wrote:
Hello,
I am using slurm 23.02.6. I have a strange issue. I periodically use sacct to dump job data. I then generate reports based on the resource allocation of our users.
Recently, I noticed some 'missing' jobs from my query. The missing jobs came from a user who had a large array job, who then 'held' all of the array jobs. This included 'holding' the Running array jobs. Now, if I run `sacct -a -S YYYY-MM-DD --format="jobidraw,jobname"`, the job will be missing from that query.
However, if I query specifically for that job, i.e. `sacct -j RAWJOBID -S YYYY-MM-DD --format="jobidraw,jobname", the job is present.
*Question* :
- How can I include the 'held' running job when I do my bulk query with
`sacct -a`? Finding these outliers and adding them ad-hoc to my dumped file is too laborious and isn't feasible.
*Minimum working example *: #. Submit a job : myuser@clusterb01:~$ srun --pty bash # landed on dgx29
#. Hold job myuser@clusterb01:~$ scontrol hold 120918 myuser@clusterb01:~$ scontrol show job=120918 JobId=120918JobName=bash UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A Priority=0 Nice=0 Account=allusers QOS=normal JobState=*RUNNING* Reason=*JobHeldUser* Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:29 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown AccrueTime=Unknown StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-15T13:31:28 Scheduler=Main Partition=defq AllocNode:Sid=clusterb01:4145861 ReqNodeList=(null) ExcNodeList=(null) NodeList=dgx29 BatchHost=dgx29 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=9070M,node=1,billing=1 AllocTRES=cpu=2,mem=18140M,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/home/myuser Power=
#. Release job myuser@clusterb01:~$ scontrol release 120918 #. Show job again myuser@clusterb01:~$ scontrol show job=120918 JobId=120918 JobName=bash UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A Priority=1741 Nice=0 Account=allusers QOS=normal JobState=*RUNNING* Reason=*None* Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:01:39 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown AccrueTime=Unknown StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-15T13:31:28 Scheduler=Main Partition=defq AllocNode:Sid=clusterb01:4145861 ReqNodeList=(null) ExcNodeList=(null) NodeList=dgx29 BatchHost=dgx29 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=9070M,node=1,billing=1 AllocTRES=cpu=2,mem=18140M,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/home/myuser/ Power=
#. In slurmctld, I see : root@clusterb01:~# grep 120918 /var/log/slurmctld [2025-12-15T13:31:28.706] sched: _slurm_rpc_allocate_resourcesJobId=120918 NodeList=dgx29 usec=1269 [2025-12-15T13:31:47.751] sched: _hold_job_rec: hold on JobId=120918 by uid 123456 [2025-12-15T13:31:47.751] sched: _update_job: set priority to 0 for JobId=120918 [2025-12-15T13:31:47.751] _slurm_rpc_update_job: complete JobId=120918 uid=123456 usec=189 [2025-12-15T13:32:48.081] sched: _release_job_rec: release hold on JobId=120918 by uid 123456 [2025-12-15T13:32:48.081] _slurm_rpc_update_job: complete JobId=120918 uid=123456 usec=268 [2025-12-15T13:33:20.552] _job_complete: JobId=120918 WEXITSTATUS 0 [2025-12-15T13:33:20.552] _job_complete: JobId=120918 done
#. Job is NOT missing, when identifying it by jobid myuser@clusterb01:~$ sacct -j 120918 --starttime=2025-12-12-o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30" JobIDRaw JobID NodeList Start End Elapsed State SubmitLine ------------ ------------ --------------- -------------------
120918 120918 dgx29 2025-12-15T13:31:282025-12-15T13:33:20 00:01:52 COMPLETED srun --pty bash 120918.0 120918.0 dgx29 2025-12-15T13:31:28 2025-12-15T13:33:20 00:01:52 COMPLETED srun --pty bash
#. Job IS *missing* when using -a myuser@clusterb01:~$ sacct -a --starttime=2025-12-12 -o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30" | grep -i 120918 ## *MISSING*
Best regards, Lee