[slurm-users] derived counters
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Fri Apr 16 08:53:58 UTC 2021
On 4/13/21 6:29 PM, Juergen Salk wrote:
> * Heckes, Frank <heckes at mps.mpg.de> [210413 12:04]:
>> This result from a mgmt. - question. How long jobs have to wait (in s, min, h, day) before they getting executed and
>> how many jobs are waiting (are queued) for each partition in a certain time interval.
>> The first one is easy to find with sacct and submit, start counts + difference + averaging.
> Hi Frank,
> depending on the definition of "waiting time", the "reserved" field
> from sacct may be more appropriate than "start" minus "submit". For
> example for dependency jobs (aka chain jobs) the latter does also
> count the time a job had to wait for another job to finish
> whereas "reserved" will only start counting when a job becomes
The slurmacct tool
calculates the waiting time as you recommend:
wait = start - eligible
I have experienced eligible == "Unknown", in which case I use the submit
time as the best guess.
> However, the "eligible" and "reserved" fields in sacct will be
> set or increased also if a job has hit a resource throttling limit,
> which may be something you want to factor out of the job waiting time
> as well.
> Unfortunaty, I haven't found any metrics in sacct that does only
> count (or allows to derive) the time a job had to wait just for
> sufficent resources to become available. Maybe someone else?
Good point! I don't have an answer...
>> The second is a bit cumbersome, so I wonder whether a 'solution' is
>> already around. The easiest way is to monitor from the beginning and
>> store the squeue ouput for later evaluation. Unfortunately I didn’t
>> do that.
> Not sure if this is a solution for you but I think you can at
> least resample this retrospectively from sacct by using something like
> sacct -a -X -S 2021-04-01T00:00:00 -s PD -o JobID,User,Partition
> This will return job records for all jobs that were in pending state
That's a nice trick! According to the sacct man-page, when you specify
the state (-s PD) and the starttime with -S, the DEFAULT TIME WINDOW in
this case sets endtime=starttime. Thus you get a snapshot of the Pending
jobs at the instant given by -S. This could definitely be used to make
graphs of Pending jobs in each partition as a function of time.
More information about the slurm-users