[slurm-users] derived counters
juergen.salk at uni-ulm.de
Tue Apr 13 16:29:55 UTC 2021
* Heckes, Frank <heckes at mps.mpg.de> [210413 12:04]:
> This result from a mgmt. - question. How long jobs have to wait (in s, min, h, day) before they getting executed and
> how many jobs are waiting (are queued) for each partition in a certain time interval.
> The first one is easy to find with sacct and submit, start counts + difference + averaging.
depending on the definition of "waiting time", the "reserved" field
from sacct may be more appropriate than "start" minus "submit". For
example for dependency jobs (aka chain jobs) the latter does also
count the time a job had to wait for another job to finish
whereas "reserved" will only start counting when a job becomes
However, the "eligible" and "reserved" fields in sacct will be
set or increased also if a job has hit a resource throttling limit,
which may be something you want to factor out of the job waiting time
Unfortunaty, I haven't found any metrics in sacct that does only
count (or allows to derive) the time a job had to wait just for
sufficent resources to become available. Maybe someone else?
> The second is a bit cumbersome, so I wonder whether a 'solution' is
> already around. The easiest way is to monitor from the beginning and
> store the squeue ouput for later evaluation. Unfortunately I didn’t
> do that.
Not sure if this is a solution for you but I think you can at
least resample this retrospectively from sacct by using something like
sacct -a -X -S 2021-04-01T00:00:00 -s PD -o JobID,User,Partition
This will return job records for all jobs that were in pending state
at the specified time.
> > The "slurmacct" command prints (possibly for a specified partition) the
> > average job waiting time while Pending in the queue, but not the queue length
> > information.
> > It may be difficult to answer your question from the Slurm database. The sacct
> > command displays accounting data for all jobs and job steps, but not directly
> > for partitions.
> > There are other Slurm monitoring tools which perhaps can supply the data you
> > are looking for. You could ask this list again.
> > /Ole
More information about the slurm-users