[slurm-users] Fwd: Getting information about AssocGrpCPUMinutesLimit for a job

Sun Aug 11 14:17:31 UTC 2019

Andreas made a good suggestion of looking at the user's TRESRunMin from 
sshare in order to answer Jeff's question about AssocGrpCPUMinutesLimit 
for a job.  However, getting at this information is in practice really 
complicated, and I don't think any ordinary user will bother to look it up.

Due to this complexity, I have added some new functionality to my 
"showjob" script available from 
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs.

The "showjob" tool now tries to extract the information by combining the 
sshare, squeue, and sacctmgr commands.  The job reasons 
AssocGrpCPUMinutesLimit as well as AssocGrpCpuLimit are treated.

An example output for a job is:

$ showjob  1347368
Job 1347368 of user xxx in account yyy has a jobstate=PENDING with 
reason=AssocGrpCpuLimit

Information about GrpCpuLimit:
User GrpTRES limit is:     cpu=1600
Current user TRES is:      cpu=1360
This job requires TRES:    cpu=960
...

I think some end users might find this information useful.

Could I ask any interested sites to test the "showjob" tool to see if 
the logic works also in their environment?  Please send me feedback so 
that I may possibly improve the tool.

Best regards,
Ole

On 09-08-2019 08:00, Henkel, Andreas wrote:
>> Users may call sshare -l and have a look at the TRESRunMin. There the
>> number  of  TRES-minutes  allocated  by jobs currently running against
>> the account is listed. With a little math (cpu*timelimit) about the job
>> in question the users should be able to figure this out. At least they
>> wouldn't need the debug level increased ot a log file.
>>
>> Best,
>>
>> Andreas
>>
>> On 8/7/19 8:47 PM, Sarlo, Jeffrey S wrote:
>>> We had a job queued waiting for resources and when we changed the 
>>> debug level, we were able to get the following in the slurmctld.log file.
>>>
>>> [2019-08-02T10:03:47.347] debug2: JobId=804633 being held, the job is 
>>> at or exceeds assoc 50(jeff/(null)/(null)) group max tres(cpu) 
>>> minutes of 30000000 of which 1436396 are still available but request 
>>> is for 1440000 (plus 0 already in use) tres minutes (request tres 
>>> count 80)
>>>
>>> We were then able to see that we just needed to lower the timelimit 
>>> for the job a little.
>>>
>>> Is there a way a user can get this same type of information for a 
>>> job, without having to change the slurm debug level and then looking 
>>> in a log file?