[slurm-users] Upgraded Slurm 17.02 to 19.05, now GRPTRESRunMin limits are applied incorrectly

Mon Dec 16 19:03:44 UTC 2019

Hi Mike,

My showuserlimits tool prints nicely user limits from the Slurm database:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits

Maybe this can give you further insights into the source of problems.

/Ole

On 16-12-2019 17:27, Renfro, Michael wrote:
> Hey, folks. I’ve just upgraded from Slurm 17.02 (way behind schedule, I know) to 19.05. The only thing I’ve noticed going wrong is that my user resource limits aren’t being applied correctly.
> 
> My typical user has a GrpTRESRunMin limit of cpu=1440000 (1000 CPU-days), and after the upgrade, it appears that limit is blocking jobs even when I’m only requesting a very small amount of resources (2 CPU-hours).
> 
> With no limits, job runs fine:
> 
> =====
> 
> [root at login ~]# squeue -u renfro
>               JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
> [root at login ~]# sacctmgr modify user renfro set grptresrunmin=cpu=-1
> 
> [renfro at login ~]$ hpcshell --reservation=slurm-upgrade
> [renfro at gpunode001(job 232393) ~]$ exit
> 
> =====
> 
> With the 1000 CPU-days limit, a 2 CPU-hour jobs is permanently pending:
> 
> =====
> 
> [root at login ~]# sacctmgr modify user renfro set grptresrunmin=cpu=1440000
> 
> [renfro at login ~]$ hpcshell --reservation=slurm-upgrade
> srun: job 232394 queued and waiting for resources
> 
> [root at login ~]# scontrol show job 232394
> JobId=232394 JobName=bash
>     UserId=renfro(177805483) GroupId=domain users(177800513) MCS_label=N/A
>     Priority=99249 Nice=0 Account=hpcadmins QOS=normal
>     JobState=PENDING Reason=AssocGrpCPURunMinutesLimit Dependency=(null)
>     Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>     RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
>     SubmitTime=2019-12-16T10:22:38 EligibleTime=2019-12-16T10:22:38
>     AccrueTime=2019-12-16T10:22:38
>     StartTime=Unknown EndTime=Unknown Deadline=N/A
>     SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-16T10:22:43
>     Partition=any-interactive AllocNode:Sid=login.hpc.tntech.edu:74850
>     ReqNodeList=(null) ExcNodeList=(null)
>     NodeList=(null)
>     NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>     TRES=cpu=1,mem=2000M,node=1,billing=1
>     Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>     MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
>     Features=(null) DelayBoot=00:00:00
>     Reservation=slurm-upgrade
>     OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>     Command=bash
>     WorkDir=/home/tntech.edu/renfro
>     Power=
> 
> =====
> 
> No other jobs under the hpcadmins account are running or queued. Any ideas on what might be going on? Thanks for any help provided.