Hello,
I am using slurm 23.02.6. I have a strange issue. I periodically use sacct to dump job data. I then generate reports based on the resource allocation of our users.
Recently, I noticed some 'missing' jobs from my query. The missing jobs came from a user who had a large array job, who then 'held' all of the array jobs. This included 'holding' the Running array jobs. Now, if I run `sacct -a -S YYYY-MM-DD --format="jobidraw,jobname"`, the job will be missing from that query.
However, if I query specifically for that job, i.e. `sacct -j RAWJOBID -S YYYY-MM-DD --format="jobidraw,jobname", the job is present.
*Question* : 1. How can I include the 'held' running job when I do my bulk query with `sacct -a`? Finding these outliers and adding them ad-hoc to my dumped file is too laborious and isn't feasible.
*Minimum working example *: #. Submit a job : myuser@clusterb01:~$ srun --pty bash # landed on dgx29
#. Hold job myuser@clusterb01:~$ scontrol hold 120918 myuser@clusterb01:~$ scontrol show job=120918 JobId=120918 JobName=bash UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A Priority=0 Nice=0 Account=allusers QOS=normal JobState=*RUNNING* Reason=*JobHeldUser* Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:29 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown AccrueTime=Unknown StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-15T13:31:28 Scheduler=Main Partition=defq AllocNode:Sid=clusterb01:4145861 ReqNodeList=(null) ExcNodeList=(null) NodeList=dgx29 BatchHost=dgx29 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=9070M,node=1,billing=1 AllocTRES=cpu=2,mem=18140M,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/home/myuser Power=
#. Release job myuser@clusterb01:~$ scontrol release 120918
#. Show job again myuser@clusterb01:~$ scontrol show job=120918 JobId=120918 JobName=bash UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A Priority=1741 Nice=0 Account=allusers QOS=normal JobState=*RUNNING* Reason=*None* Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:01:39 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown AccrueTime=Unknown StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-15T13:31:28 Scheduler=Main Partition=defq AllocNode:Sid=clusterb01:4145861 ReqNodeList=(null) ExcNodeList=(null) NodeList=dgx29 BatchHost=dgx29 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=9070M,node=1,billing=1 AllocTRES=cpu=2,mem=18140M,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/home/myuser/ Power=
#. In slurmctld, I see : root@clusterb01:~# grep 120918 /var/log/slurmctld [2025-12-15T13:31:28.706] sched: _slurm_rpc_allocate_resources JobId=120918 NodeList=dgx29 usec=1269 [2025-12-15T13:31:47.751] sched: _hold_job_rec: hold on JobId=120918 by uid 123456 [2025-12-15T13:31:47.751] sched: _update_job: set priority to 0 for JobId=120918 [2025-12-15T13:31:47.751] _slurm_rpc_update_job: complete JobId=120918 uid=123456 usec=189 [2025-12-15T13:32:48.081] sched: _release_job_rec: release hold on JobId=120918 by uid 123456 [2025-12-15T13:32:48.081] _slurm_rpc_update_job: complete JobId=120918 uid=123456 usec=268 [2025-12-15T13:33:20.552] _job_complete: JobId=120918 WEXITSTATUS 0 [2025-12-15T13:33:20.552] _job_complete: JobId=120918 done
#. Job is NOT missing, when identifying it by jobid myuser@clusterb01:~$ sacct -j 120918 --starttime=2025-12-12 -o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30" JobIDRaw JobID NodeList Start End Elapsed State SubmitLine ------------ ------------ --------------- ------------------- ------------------- ---------- ---------- ------------------------------ 120918 120918 dgx29 2025-12-15T13:31:28 2025-12-15T13:33:20 00:01:52 COMPLETED srun --pty bash 120918.0 120918.0 dgx29 2025-12-15T13:31:28 2025-12-15T13:33:20 00:01:52 COMPLETED srun --pty bash
#. Job IS *missing* when using -a myuser@clusterb01:~$ sacct -a --starttime=2025-12-12 -o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30" | grep -i 120918 ## *MISSING*
Best regards, Lee
Hello,
I replicated this issue on a different cluster and determined that the root cause is that the time_eligible in the underlying MySQL database gets set to 0 when a running job is held. Let me demonstrate.
1. Allocate a job and check that I can query it via `sacct -S YYYY-MM-DD`
jess@bcm10-h01:~$ srun --pty bash jess@bcm10-n01:~$ squeue JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_M 114 defq bash jess R 1:13 1 1 2900M
root@bcm10-h01:~# sacct -S 2026-01-06 -a JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 114 bash defq allusers 1 RUNNING 0:0 114.0 bash allusers 1 RUNNING 0:0
root@bcm10-h01:~# scontrol show jobid=114 | grep EligibleTime SubmitTime=2026-01-06T14:52:04* EligibleTime=2026-01-06T14:52:04*
2. Hold job, confirm that it is no longer queryable via `sacct -S YYYY-MM-DD`, notice EligibleTime changes to Unknown.
jess@bcm10-n01:~$ scontrol hold 114 jess@bcm10-n01:~$ scontrol release 114
root@bcm10-h01:~# sacct -S 2026-01-06 -a JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- --------
root@bcm10-h01:~# scontrol show jobid=114 | grep EligibleTime SubmitTime=2026-01-06T14:52:04 *EligibleTime=Unknown*
3. Check time_eligible in the underlying MySQL database and confirm that changing time_eligible makes it queryable via `sacct -S YYYY-MM-DD`.
root@bcm10-h01:~# mysql --host=localhost --user=slurm --password=XYZ slurm_acct_db mysql> SELECT id_job FROM slurm_job_table WHERE time_eligible = 0; +--------+ | id_job | +--------+ | *114* | | 112 | | 113 | +--------+ 3 rows in set (0.00 sec)
mysql> UPDATE slurm_job_table SET time_eligible = 1767733491 WHERE id_job = 114; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0
mysql> SELECT time_eligible FROM slurm_job_table WHERE id_job = 114; +---------------+ | time_eligible | +---------------+ | 1767733491 | +---------------+ 1 row in set (0.00 sec)
### WORKS AGAIN root@bcm10-h01:~# sacct -S 2026-01-06 -a JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 114 bash defq allusers 1 RUNNING 0:0 114.0 bash allusers 1 RUNNING 0:0
4. In the man page for sacct, it says things like :
"For example jobs submitted with the "--hold" option will have "EligibleTime=Unknown" as they are pending indefinitely."
*Conclusion : * This very much feels like a *bug*. It doesn't seem like running jobs should be able to be 'held' because they can't be pending indefinitely due to the fact that they are actively running. I don't think that the EligibleTime should subsequently change when a user tries to 'hold' a running job either.
*Question : * 1. Identifying these problematic jobs via the underlying MySQL database seems not optimal. Are there any better workarounds?
Best regards, Lee
On Mon, Dec 15, 2025 at 2:33 PM Lee leewithemily@gmail.com wrote:
Hello,
I am using slurm 23.02.6. I have a strange issue. I periodically use sacct to dump job data. I then generate reports based on the resource allocation of our users.
Recently, I noticed some 'missing' jobs from my query. The missing jobs came from a user who had a large array job, who then 'held' all of the array jobs. This included 'holding' the Running array jobs. Now, if I run `sacct -a -S YYYY-MM-DD --format="jobidraw,jobname"`, the job will be missing from that query.
However, if I query specifically for that job, i.e. `sacct -j RAWJOBID -S YYYY-MM-DD --format="jobidraw,jobname", the job is present.
*Question* :
- How can I include the 'held' running job when I do my bulk query with
`sacct -a`? Finding these outliers and adding them ad-hoc to my dumped file is too laborious and isn't feasible.
*Minimum working example *: #. Submit a job : myuser@clusterb01:~$ srun --pty bash # landed on dgx29
#. Hold job myuser@clusterb01:~$ scontrol hold 120918 myuser@clusterb01:~$ scontrol show job=120918 JobId=120918JobName=bash UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A Priority=0 Nice=0 Account=allusers QOS=normal JobState=*RUNNING* Reason=*JobHeldUser* Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:29 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown AccrueTime=Unknown StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-15T13:31:28 Scheduler=Main Partition=defq AllocNode:Sid=clusterb01:4145861 ReqNodeList=(null) ExcNodeList=(null) NodeList=dgx29 BatchHost=dgx29 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=9070M,node=1,billing=1 AllocTRES=cpu=2,mem=18140M,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/home/myuser Power=
#. Release job myuser@clusterb01:~$ scontrol release 120918 #. Show job again myuser@clusterb01:~$ scontrol show job=120918 JobId=120918 JobName=bash UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A Priority=1741 Nice=0 Account=allusers QOS=normal JobState=*RUNNING* Reason=*None* Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:01:39 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown AccrueTime=Unknown StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-15T13:31:28 Scheduler=Main Partition=defq AllocNode:Sid=clusterb01:4145861 ReqNodeList=(null) ExcNodeList=(null) NodeList=dgx29 BatchHost=dgx29 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=9070M,node=1,billing=1 AllocTRES=cpu=2,mem=18140M,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/home/myuser/ Power=
#. In slurmctld, I see : root@clusterb01:~# grep 120918 /var/log/slurmctld [2025-12-15T13:31:28.706] sched: _slurm_rpc_allocate_resourcesJobId=120918 NodeList=dgx29 usec=1269 [2025-12-15T13:31:47.751] sched: _hold_job_rec: hold on JobId=120918 by uid 123456 [2025-12-15T13:31:47.751] sched: _update_job: set priority to 0 for JobId=120918 [2025-12-15T13:31:47.751] _slurm_rpc_update_job: complete JobId=120918 uid=123456 usec=189 [2025-12-15T13:32:48.081] sched: _release_job_rec: release hold on JobId=120918 by uid 123456 [2025-12-15T13:32:48.081] _slurm_rpc_update_job: complete JobId=120918 uid=123456 usec=268 [2025-12-15T13:33:20.552] _job_complete: JobId=120918 WEXITSTATUS 0 [2025-12-15T13:33:20.552] _job_complete: JobId=120918 done
#. Job is NOT missing, when identifying it by jobid myuser@clusterb01:~$ sacct -j 120918 --starttime=2025-12-12-o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30" JobIDRaw JobID NodeList Start End Elapsed State SubmitLine ------------ ------------ --------------- -------------------
120918 120918 dgx29 2025-12-15T13:31:282025-12-15T13:33:20 00:01:52 COMPLETED srun --pty bash 120918.0 120918.0 dgx29 2025-12-15T13:31:28 2025-12-15T13:33:20 00:01:52 COMPLETED srun --pty bash
#. Job IS *missing* when using -a myuser@clusterb01:~$ sacct -a --starttime=2025-12-12 -o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30" | grep -i 120918 ## *MISSING*
Best regards, Lee
Hi Lee,
Just my 2 cents: Which database and OS versions do you run?
Furthermore, Slurm 23.02 is really old, so I'd recommend upgrading to 25.05 (or perhaps even 25.11). It just might be that your bug has been resolved in later versions of Slurm or MySQL/MariaDB.
You can find detailed upgrade instructions in [1]. Be especially mindful of the MySQL and slurmdbd upgrades, and perform a dry-run upgrade first on a test node.
On 1/7/26 13:22, Lee via slurm-users wrote:
I replicated this issue on a different cluster and determined that the root cause is that the time_eligible in the underlying MySQL database gets set to 0 when a running job is held. Let me demonstrate.
...
I am using slurm 23.02.6. I have a strange issue. I periodically use sacct to dump job data. I then generate reports based on the resource allocation of our users.
IHTH, Ole
[1] https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slur...
Thanks for the suggestion. In my test environment, I'm running :
root@bcm10-h01:~# mysql -V mysql Ver 8.0.36-0ubuntu0.22.04.1 for Linux on x86_64 ((Ubuntu))
root@bcm10-h01:~# cat /etc/os-release | grep PRETTY PRETTY_NAME="Ubuntu 22.04.4 LTS"
This closely matches my production environment.
My production environment is running in an Nvidia POD ecosystem and I'm using Base Command Manager (v10) to manage my cluster. It does seem that the version of Slurm in the BCM iso tends to lag behind by at least 12 months. All this is to say updating individual cluster components in the Base Command Environment isn't straightforward.
Best, Lee
On Thu, Jan 8, 2026 at 2:30 AM Ole Holm Nielsen via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi Lee,
Just my 2 cents: Which database and OS versions do you run?
Furthermore, Slurm 23.02 is really old, so I'd recommend upgrading to 25.05 (or perhaps even 25.11). It just might be that your bug has been resolved in later versions of Slurm or MySQL/MariaDB.
You can find detailed upgrade instructions in [1]. Be especially mindful of the MySQL and slurmdbd upgrades, and perform a dry-run upgrade first on a test node.
On 1/7/26 13:22, Lee via slurm-users wrote:
I replicated this issue on a different cluster and determined that the root cause is that the time_eligible in the underlying MySQL database
gets
set to 0 when a running job is held. Let me demonstrate.
...
I am using slurm 23.02.6. I have a strange issue. I periodicallyuse
sacct to dump job data. I then generate reports based on theresource
allocation of our users.IHTH, Ole
[1]
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slur...
Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark,
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 12/15/25 11:33 am, Lee via slurm-users wrote:
I am using slurm 23.02.6.
FYI there's 6 security issues that have since been fixed since 23.02.6, also 23.02.7 had a lot of other fixes in it. The last 23.02 release was 23.02.9:
https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-23.02.md
But that version is long abandoned, 24.11 is the most oldest supported release (and you'd need to upgrade 23.02 to either 23.11 or 24.05 first as Slurm only supports upgrading the previous 2 versions at any time).
FWIW we're running 24.11.7 and plan to upgrade directly to 25.11.x early this year if testing goes well.
All the best, Chris