[slurm-users] Upgraded Slurm 17.02 to 19.05, now GRPTRESRunMin limits are applied incorrectly

Renfro, Michael Renfro at tntech.edu
Mon Dec 16 20:00:26 UTC 2019


Thanks, Ole. I forgot I had that tool already. Not seeing where the limits are getting enforced. But now I’ve narrowed it down to some of my partitions or my job routing Lua plugin:

=====

[renfro at login ~]$ hpcshell --reservation=slurm-upgrade --partition=interactive
srun: job 232423 queued and waiting for resources
^Csrun: Job allocation 232423 has been revoked
srun: Force Terminated job 232423
[renfro at login ~]$ hpcshell --reservation=slurm-upgrade --partition=batch
[renfro at node001(job 232424) ~]$ exit
[renfro at login ~]$

=====

=====

JobId=232423 UserId=renfro(177805483) GroupId=domain users(177800513) Name=bash JobState=CANCELLED Partition=any-interactive TimeLimit=120 StartTime=2019-12-16T13:58:59 EndTime=2019-12-16T13:58:59 NodeList=(null) NodeCnt=0 ProcCnt=1 WorkDir=/home/tntech.edu/renfro ReservationName=slurm-upgrade Gres= Account=hpcadmins QOS=normal WcKey= Cluster=its SubmitTime=2019-12-16T13:58:56 EligibleTime=2019-12-16T13:58:56 DerivedExitCode=0:0 ExitCode=0:0
JobId=232424 UserId=renfro(177805483) GroupId=domain users(177800513) Name=bash JobState=COMPLETED Partition=batch TimeLimit=1440 StartTime=2019-12-16T13:59:02 EndTime=2019-12-16T13:59:20 NodeList=node001 NodeCnt=1 ProcCnt=1 WorkDir=/home/tntech.edu/renfro ReservationName=slurm-upgrade Gres= Account=hpcadmins QOS=normal WcKey= Cluster=its SubmitTime=2019-12-16T13:59:02 EligibleTime=2019-12-16T13:59:02 DerivedExitCode=0:0 ExitCode=0:0

=====

> On Dec 16, 2019, at 1:03 PM, Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
> 
> ________________________________
> 
> Hi Mike,
> 
> My showuserlimits tool prints nicely user limits from the Slurm database:
> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits
> 
> Maybe this can give you further insights into the source of problems.
> 
> /Ole
> 
> On 16-12-2019 17:27, Renfro, Michael wrote:
>> Hey, folks. I’ve just upgraded from Slurm 17.02 (way behind schedule, I know) to 19.05. The only thing I’ve noticed going wrong is that my user resource limits aren’t being applied correctly.
>> 
>> My typical user has a GrpTRESRunMin limit of cpu=1440000 (1000 CPU-days), and after the upgrade, it appears that limit is blocking jobs even when I’m only requesting a very small amount of resources (2 CPU-hours).
>> 
>> With no limits, job runs fine:
>> 
>> =====
>> 
>> [root at login ~]# squeue -u renfro
>>              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>> [root at login ~]# sacctmgr modify user renfro set grptresrunmin=cpu=-1
>> 
>> [renfro at login ~]$ hpcshell --reservation=slurm-upgrade
>> [renfro at gpunode001(job 232393) ~]$ exit
>> 
>> =====
>> 
>> With the 1000 CPU-days limit, a 2 CPU-hour jobs is permanently pending:
>> 
>> =====
>> 
>> [root at login ~]# sacctmgr modify user renfro set grptresrunmin=cpu=1440000
>> 
>> [renfro at login ~]$ hpcshell --reservation=slurm-upgrade
>> srun: job 232394 queued and waiting for resources
>> 
>> [root at login ~]# scontrol show job 232394
>> JobId=232394 JobName=bash
>>    UserId=renfro(177805483) GroupId=domain users(177800513) MCS_label=N/A
>>    Priority=99249 Nice=0 Account=hpcadmins QOS=normal
>>    JobState=PENDING Reason=AssocGrpCPURunMinutesLimit Dependency=(null)
>>    Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>>    RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
>>    SubmitTime=2019-12-16T10:22:38 EligibleTime=2019-12-16T10:22:38
>>    AccrueTime=2019-12-16T10:22:38
>>    StartTime=Unknown EndTime=Unknown Deadline=N/A
>>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-16T10:22:43
>>    Partition=any-interactive AllocNode:Sid=login.hpc.tntech.edu:74850
>>    ReqNodeList=(null) ExcNodeList=(null)
>>    NodeList=(null)
>>    NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>    TRES=cpu=1,mem=2000M,node=1,billing=1
>>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>    MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
>>    Features=(null) DelayBoot=00:00:00
>>    Reservation=slurm-upgrade
>>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>>    Command=bash
>>    WorkDir=/home/tntech.edu/renfro
>>    Power=
>> 
>> =====
>> 
>> No other jobs under the hpcadmins account are running or queued. Any ideas on what might be going on? Thanks for any help provided.
> 



More information about the slurm-users mailing list