Hi everyone,
I am currently stuck with an sacct issue and would appreciate any help/hints/ideas:
My users cannot retrieve job data from their currently running jobs through sacct anymore. Running sacct -a as root also reproduces this issue: It does not show running jobs, but both sacct -j <JobID> and squeue -j <JobID> do. AFAICT, this is not intended behavior (?). Also including longer time windows witch sacct -S ... -E did not help.
root@slurmmaster:~# sacct -a | grep 154415 # this returns nothing root@slurmmaster:~# sacct -j 154415 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 154415 allocation primevo 0 PENDING 0:0 154415.batch batch primevo 2 RUNNING 0:0 154415.exte+ extern primevo 2 RUNNING 0:0 root@slurmmaster:~# squeue -j 154415 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 154415 standard genedrop username R 1:31 1 hpc020
Also, possibly related, we had a slurmdbd crash before this changed.
We run Ubuntu Server 24.04 LTS with Slurm 24.05.4, using a MariaDB accounting database hosted on the same machine as the Slurm controller.
Does anyone here have any ideas?
Best, Pierre