[slurm-users] How to enable QOS correctly?

Tue Mar 5 17:29:19 UTC 2019

So here is a default partition

PartitionName=BDW
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid00[016-063]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=3456 TotalNodes=48 SelectTypeParameters=NONE
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

If we just flip on     AccountingStorageEnforce=limits,qos,(tried adding safe as well) no jobs can run.

Here is a running job which shows the default "normal" QOS that was created when slurm was installed

JobId=244667 JobName=em25d_SEAM
   UserId=j0497482(10214) GroupId=rt3(501) MCS_label=N/A
   Priority=1 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2019-03-05T11:24:41 EligibleTime=2019-03-05T11:24:41
   StartTime=2019-03-05T11:24:41 EndTime=2019-03-06T11:24:41 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=KNL AllocNode:Sid=hickory-1:4991
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=nid00605
   BatchHost=nid00605
   NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=256,mem=96763M,node=1,gres/craynetwork=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=96763M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=craynetwork:1 Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=./.prg29913/tmp/DIVAcmdEXEC29913.py None /home/j0497482/bin/em25d_SEAM  mode=forward model=mod1.h5 model_H=none i_bwc=0 flist=0.15,0.25,0.5,1.0 verbose=5 sabs=1 rabs=1 acqui_file=acq1 nky=20 ofile=out_forward.nc minOff=2000.0,2000.0,2000.0,2000.0 maxOff=10000.0,10000.0,10000.0,10000.0 NoiseEx=1.0e-14,1.0e-14,1.0e-14,1.0e-14 bedThreshold=2
   WorkDir=/data/gpfs/Users/j0497482/data/EM_data/Model46
   StdErr=/data/gpfs/Users/j0497482/data/EM_data/Model46/./logs/Job_Standalone_244667.slurm_err
   StdIn=/dev/null
   StdOut=/data/gpfs/Users/j0497482/data/EM_data/Model46/./logs/Job_Standalone_244667.slurm_log
   Power=

sacctmgr show qos normal
normal          0   00:00:00                cluster                                                        1.000000 

On 3/5/19, 10:47 AM, "slurm-users on behalf of Michael Gutteridge" <slurm-users-bounces at lists.schedmd.com on behalf of michael.gutteridge at gmail.com> wrote:

    Hi

    It might be useful to see the configuration of the partition and how the QOS is set up... but at first blush I suspect you may need to set OverPartQOS (https://slurm.schedmd.com/resource_limits.html)
     to get the QOS limit to take precedence over the limit in the partition.  However, the "reason" should be different if that were the case.

    Look at that, maybe send the QOS and partition config.

     - Michael

    On Tue, Mar 5, 2019 at 7:40 AM Matthew BETTINGER <matthew.bettinger at external.total.com> wrote:

    Hey slurm gurus.  We have been trying to enable slurm QOS on a cray system here off and on for quite a while but can never get it working.  Every time we try to enable QOS we disrupt the cluster and users and have to fall back.  I'm not sure what we are doing
     wrong.  We run a pretty open system here since we are a research group but there are time where we need to let a user run a job to exceed a partition limit.  In lieu of using QOs the only other way we have figured out how to do this is create a new partition
     and push out the modified slurm.conf.  It's a hassle.

    I'm not sure what information is needed exactly to troubleshoot this but I understand to enable QOS we need to enable this line in slurm.conf

    AccountingStorageEnforce=limits,qos

    Every time we attempt this no one can submit a job, slurm says waiting on resources I believe.

    We have accounting enabled  and everyone is a member of the default qos group "normal". 

    Configuration data as of 2019-03-05T09:36:19
    AccountingStorageBackupHost = (null)
    AccountingStorageEnforce = none
    AccountingStorageHost   = hickory-1
    AccountingStorageLoc    = N/A
    AccountingStoragePort   = 6819
    AccountingStorageTRES   = cpu,mem,energy,node,bb/cray,gres/craynetwork,gres/gpu
    AccountingStorageType   = accounting_storage/slurmdbd
    AccountingStorageUser   = N/A
    AccountingStoreJobComment = Yes
    AcctGatherEnergyType    = acct_gather_energy/rapl
    AcctGatherFilesystemType = acct_gather_filesystem/none
    AcctGatherInfinibandType = acct_gather_infiniband/none
    AcctGatherNodeFreq      = 30 sec
    AcctGatherProfileType   = acct_gather_profile/none
    AllowSpecResourcesUsage = 1
    AuthInfo                = (null)
    AuthType                = auth/munge
    BackupAddr              = hickory-2
    BackupController        = hickory-2
    BatchStartTimeout       = 10 sec
    BOOT_TIME               = 2019-03-04T16:11:55
    BurstBufferType         = burst_buffer/cray
    CacheGroups             = 0
    CheckpointType          = checkpoint/none
    ChosLoc                 = (null)
    ClusterName             = hickory
    CompleteWait            = 0 sec
    ControlAddr             = hickory-1
    ControlMachine          = hickory-1
    CoreSpecPlugin          = cray
    CpuFreqDef              = Performance
    CpuFreqGovernors        = Performance,OnDemand
    CryptoType              = crypto/munge
    DebugFlags              = (null)
    DefMemPerNode           = UNLIMITED
    DisableRootJobs         = No
    EioTimeout              = 60
    EnforcePartLimits       = NO
    Epilog                  = (null)
    EpilogMsgTime           = 2000 usec
    EpilogSlurmctld         = (null)
    ExtSensorsType          = ext_sensors/none
    ExtSensorsFreq          = 0 sec
    FairShareDampeningFactor = 1
    FastSchedule            = 0
    FirstJobId              = 1
    GetEnvTimeout           = 2 sec
    GresTypes               = gpu,craynetwork
    GroupUpdateForce        = 1
    GroupUpdateTime         = 600 sec
    HASH_VAL                = Match
    HealthCheckInterval     = 0 sec
    HealthCheckNodeState    = ANY
    HealthCheckProgram      = (null)
    InactiveLimit           = 0 sec
    JobAcctGatherFrequency  = 30
    JobAcctGatherType       = jobacct_gather/linux
    JobAcctGatherParams     = (null)
    JobCheckpointDir        = /var/slurm/checkpoint
    JobCompHost             = localhost
    JobCompLoc              = /var/log/slurm_jobcomp.log
    JobCompPort             = 0
    JobCompType             = jobcomp/none
    JobCompUser             = root
    JobContainerType        = job_container/cncu
    JobCredentialPrivateKey = (null)
    JobCredentialPublicCertificate = (null)
    JobFileAppend           = 0
    JobRequeue              = 1
    JobSubmitPlugins        = cray
    KeepAliveTime           = SYSTEM_DEFAULT
    KillOnBadExit           = 1
    KillWait                = 30 sec
    LaunchParameters        = (null)
    LaunchType              = launch/slurm
    Layouts                 =
    Licenses                = (null)
    LicensesUsed            = (null)
    MailDomain              = (null)
    MailProg                = /bin/mail
    MaxArraySize            = 1001
    MaxJobCount             = 10000
    MaxJobId                = 67043328
    MaxMemPerCPU            = 128450
    MaxStepCount            = 40000
    MaxTasksPerNode         = 512
    MCSPlugin               = mcs/none
    MCSParameters           = (null)
    MemLimitEnforce         = Yes
    MessageTimeout          = 10 sec
    MinJobAge               = 300 sec
    MpiDefault              = none
    MpiParams               = ports=20000-32767
    MsgAggregationParams    = (null)
    NEXT_JOB_ID             = 244342
    NodeFeaturesPlugins     = (null)
    OverTimeLimit           = 0 min
    PluginDir               = /opt/slurm/17.02.6/lib64/slurm
    PlugStackConfig         = /etc/opt/slurm/plugstack.conf
    PowerParameters         = (null)
    PowerPlugin             =
    PreemptMode             = OFF
    PreemptType             = preempt/none
    PriorityParameters      = (null)
    PriorityDecayHalfLife   = 7-00:00:00
    PriorityCalcPeriod      = 00:05:00
    PriorityFavorSmall      = No
    PriorityFlags           =
    PriorityMaxAge          = 7-00:00:00
    PriorityUsageResetPeriod = NONE
    PriorityType            = priority/multifactor
    PriorityWeightAge       = 0
    PriorityWeightFairShare = 0
    PriorityWeightJobSize   = 0
    PriorityWeightPartition = 0
    PriorityWeightQOS       = 0
    PriorityWeightTRES      = (null)
    PrivateData             = none
    ProctrackType           = proctrack/cray
    Prolog                  = (null)
    PrologEpilogTimeout     = 65534
    PrologSlurmctld         = (null)
    PrologFlags             = (null)
    PropagatePrioProcess    = 0
    PropagateResourceLimits = (null)
    PropagateResourceLimitsExcept = AS
    RebootProgram           = (null)
    ReconfigFlags           = (null)
    RequeueExit             = (null)
    RequeueExitHold         = (null)
    ResumeProgram           = (null)
    ResumeRate              = 300 nodes/min
    ResumeTimeout           = 60 sec
    ResvEpilog              = (null)
    ResvOverRun             = 0 min
    ResvProlog              = (null)
    ReturnToService         = 2
    RoutePlugin             = route/default
    SallocDefaultCommand    = (null)
    SbcastParameters        = (null)
    SchedulerParameters     = (null)
    SchedulerTimeSlice      = 30 sec
    SchedulerType           = sched/backfill
    SelectType              = select/cray
    SelectTypeParameters    = CR_CORE_MEMORY,OTHER_CONS_RES,NHC_ABSOLUTELY_NO
    SlurmUser               = root(0)
    SlurmctldDebug          = info
    SlurmctldLogFile        = /var/spool/slurm/slurmctld.log
    SlurmctldPort           = 6817
    SlurmctldTimeout        = 120 sec
    SlurmdDebug             = info
    SlurmdLogFile           = /var/spool/slurmd/%h.log
    SlurmdPidFile           = /var/spool/slurmd/slurmd.pid
    SlurmdPlugstack         = (null)
    SlurmdPort              = 6818
    SlurmdSpoolDir          = /var/spool/slurmd
    SlurmdTimeout           = 300 sec
    SlurmdUser              = root(0)
    SlurmSchedLogFile       = (null)
    SlurmSchedLogLevel      = 0
    SlurmctldPidFile        = /var/spool/slurm/slurmctld.pid
    SlurmctldPlugstack      = (null)
    SLURM_CONF              = /etc/opt/slurm/slurm.conf
    SLURM_VERSION           = 17.02.6
    SrunEpilog              = (null)
    SrunPortRange           = 0-0
    SrunProlog              = (null)
    StateSaveLocation       = /apps/cluster/hickory/slurm/
    SuspendExcNodes         = (null)
    SuspendExcParts         = (null)
    SuspendProgram          = (null)
    SuspendRate             = 60 nodes/min
    SuspendTime             = NONE
    SuspendTimeout          = 30 sec
    SwitchType              = switch/cray
    TaskEpilog              = (null)
    TaskPlugin              = task/cray,task/affinity,task/cgroup
    TaskPluginParam         = (null type)
    TaskProlog              = (null)
    TCPTimeout              = 2 sec
    TmpFS                   = /tmp
    TopologyParam           = (null)
    TopologyPlugin          = topology/none
    TrackWCKey              = No
    TreeWidth               = 50
    UsePam                  = 0
    UnkillableStepProgram   = (null)
    UnkillableStepTimeout   = 60 sec
    VSizeFactor             = 0 percent
    WaitTime                = 0 sec

    Slurmctld(primary/backup) at hickory-1/hickory-2 are UP/UP