[slurm-users] DenyOnLimit flag ignored for QOS, always rejects?

Fri Jan 25 16:35:15 UTC 2019

Hey, folks. Running 17.02.10 with Bright Cluster Manager 8.0.

I wanted to limit queue-stuffing on my GPU nodes, similar to what AssocGrpCPURunMinutesLimit does. The current goal is to restrict a user to having 8 active or queued jobs in the production GPU partition, and block (not reject) other jobs to allow other users fair access to the queue. I'm good with a time limit instead of a job number limit, too.

I'd assumed a partition QOS was the way to go, as the sacctmgr man page reads in part:

    Flags  Used by the slurmctld to override or enforce certain characteristics.
           Valid options are

           DenyOnLimit
             If set, jobs using this QOS will be rejected at submission time if they do not conform to the QOS 'Max' limits. Group limits will also be treated like 'Max' limits as well and will be denied if they go over. By default jobs that go over these limits will pend until they conform. This currently only applies to QOS and Association limits.

So avoid setting the DenyOnLimit flag, and extra jobs will pend until they conform, right? My QOS settings for 8 active or pending GPU jobs per user are as follows:

    $ sacctmgr list qos normal,gpu format=name,priority,gracetime,preemptmode,usagefactor,grptresrunmin,MaxSubmitJobsPerUser,flags
          Name   Priority  GraceTime PreemptMode UsageFactor GrpTRESRunMin MaxSubmitPU                Flags
    ---------- ---------- ---------- ----------- ----------- ------------- ----------- --------------------
        normal          0   00:00:00     cluster    1.000000
           gpu          0   00:00:00     cluster    1.000000                         8

Partition settings, where the gpu QOS is applied to jobs in the gpu partition:

    $ egrep 'PartitionName=(batch|gpu) ' /etc/slurm/slurm.conf
    PartitionName=batch Default=YES MinNodes=1 MaxNodes=40 DefaultTime=1-00:00:00 MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=node[001-040]
    PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00 MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO MaxCPUsPerNode=16 QoS=gpu ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]

Original submission specifying CPUs, time, GRES, QOS, and partition, which accepts jobs 1-8, and rejects job 9 even though I haven't set the DenyOnLimit flag:

    $ for n in $(seq 9); do sbatch --nodes=1 --cpus-per-task=1 --time=00:10:00 --gres=gpu --qos=gpu --partition=gpu omp_hw.sh; done
    Submitted batch job 150548
    Submitted batch job 150549
    Submitted batch job 150550
    Submitted batch job 150551
    Submitted batch job 150552
    Submitted batch job 150553
    Submitted batch job 150554
    Submitted batch job 150555
    sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
    $ scancel -u $USER -p gpu

Minimized down to just the specification for CPUs, time, and partition, same results, since the gpu QOS is automatically applied to jobs in the gpu partition:

    $ for n in $(seq 9); do sbatch --nodes=1 --cpus-per-task=1 --time=00:10:00 --partition=gpu omp_hw.sh; done
    Submitted batch job 150556
    Submitted batch job 150557
    Submitted batch job 150558
    Submitted batch job 150559
    Submitted batch job 150560
    Submitted batch job 150561
    Submitted batch job 150562
    Submitted batch job 150563
    sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
    $ scancel -u $USER -p gpu

Running in the batch partition with the normal QOS, all 9 jobs are accepted:

    $ for n in $(seq 9); do sbatch --nodes=1 --cpus-per-task=1 --time=00:10:00 --partition=batch omp_hw.sh; done
    Submitted batch job 150564
    Submitted batch job 150565
    Submitted batch job 150566
    Submitted batch job 150567
    Submitted batch job 150568
    Submitted batch job 150569
    Submitted batch job 150570
    Submitted batch job 150571
    Submitted batch job 150572
    $ scancel -u $USER -p batch

Running in the batch partition with the gpu QOS explicitly specified, accepts jobs 1-8, and rejects job 9:

    $ for n in $(seq 9); do sbatch --nodes=1 --cpus-per-task=1 --time=00:10:00 --partition=batch --qos=gpu omp_hw.sh; done
    Submitted batch job 150573
    Submitted batch job 150574
    Submitted batch job 150575
    Submitted batch job 150576
    Submitted batch job 150577
    Submitted batch job 150578
    Submitted batch job 150579
    Submitted batch job 150580
    sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
    $ scancel -u $USER -p batch

So the behavior appears to be triggered by the gpu QOS. What might I have missed?

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601     / Tennessee Tech University