[slurm-users] derived counters

Fri Apr 16 08:53:58 UTC 2021

Hi Jürgen,

On 4/13/21 6:29 PM, Juergen Salk wrote:
> * Heckes, Frank <heckes at mps.mpg.de> [210413 12:04]:
> 
>> This result from a mgmt. - question. How long jobs have to wait (in s, min, h, day) before they getting executed and
>> how many jobs are waiting (are queued) for each partition in a certain time interval.
>> The first one is easy to find with sacct and submit, start counts + difference + averaging.
> 
> Hi Frank,
> 
> depending on the definition of "waiting time", the "reserved" field
> from sacct may be more appropriate than "start" minus "submit". For
> example for dependency jobs (aka chain jobs) the latter does also
> count the time a job had to wait for another job to finish
> whereas "reserved" will only start counting when a job becomes
> eligible.

The slurmacct tool 
(https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmacct) 
calculates the waiting time as you recommend:

wait = start - eligible

I have experienced eligible == "Unknown", in which case I use the submit 
time as the best guess.

> However, the "eligible" and "reserved" fields in sacct will be
> set or increased also if a job has hit a resource throttling limit,
> which may be something you want to factor out of the job waiting time
> as well.
> 
> Unfortunaty, I haven't found any metrics in sacct that does only
> count (or allows to derive) the time a job had to wait just for
> sufficent resources to become available. Maybe someone else?

Good point!  I don't have an answer...

>> The second is a bit cumbersome, so I wonder whether a 'solution' is
>> already around. The easiest way is to monitor from the beginning and
>> store the squeue ouput for later evaluation. Unfortunately I didn’t
>> do that.
> 
> Not sure if this is a solution for you but I think you can at
> least resample this retrospectively from sacct by using something like
> 
>    sacct -a -X -S 2021-04-01T00:00:00 -s PD -o JobID,User,Partition
> 
> This will return job records for all jobs that were in pending state

That's a nice trick!  According to the sacct man-page, when you specify 
the state (-s PD) and the starttime with -S, the DEFAULT TIME WINDOW in 
this case sets endtime=starttime.  Thus you get a snapshot of the Pending 
jobs at the instant given by -S.  This could definitely be used to make 
graphs of Pending jobs in each partition as a function of time.

/Ole