[slurm-users] scheduling issue

Fri Aug 14 09:22:24 UTC 2020

Hello all,

we are experiencing an issue in our cluster where sometimes entire nodes 
remain idle while jobs are pending in the queue that could run on the 
nodes in question.

Our node topology is a bit special where almost all our nodes are in one 
common partition a subset of all those nodes are then in another 
partition and this repeats once more the only difference between the 
partitions except the nodes in it are the maximum run time. The reason I 
originally set it up this way was to ensure that users with shorter jobs 
had a quicker response time and the whole cluster wouldn't be clogged up 
with long running jobs for days on end this and I was new to the whole 
cluster setup and Slurm itself. I have attached a rough visualization of 
this setup to this mail. There are 2 more totally separate partitions 
that are not in this image.

My idea for a solution would be to move all nodes to one common 
partition and using partition QOS to implement time and resource 
restrictions because I think the scheduler is not really meant to handle 
the type of setup we choose in the beginning.

But before committing to such a change I wanted to ask whatever someone 
experienced similar problems and had some advice for a redesign or maybe 
even a solution that would not require a change to the node topology so 
we do not have to reteach our users which options are required.

I will attach the scheduling parameters for our cluster. Slurm version 
is 19.05.3-2.

Thanks in advance for any help and or suggestions you might have.

Kind regards,
Erik Eisold

-------------- next part --------------
Configuration data as of 2020-08-14T11:01:40
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits
AccountingStorageHost   = xxx.xxx.xxx.xxx
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 1200 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 0
AuthAltTypes            = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 
BurstBufferType         = (null)
CheckpointType          = checkpoint/none
CliFilterPlugins        = (null)
ClusterName             = 
CommunicationParameters = (null)
CompleteWait            = 60 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand,UserSpace
CredType                = cred/munge
DebugFlags              = (null)
DefMemPerCPU            = 500
DisableRootJobs         = Yes
EioTimeout              = 30
EnforcePartLimits       = ALL
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu,scratch
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 300
JobAcctGatherType       = jobacct_gather/cgroup
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 60 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 = 
Licenses                = (null)
LicensesUsed            = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = pks.mpg.de
MailProg                = /bin/mail
MaxArraySize            = 100001
MaxJobCount             = 1000000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 100000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 53699248
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 5 min
PluginDir               = path
PlugStackConfig         = path
PowerParameters         = (null)
PowerPlugin             = 
PreemptMode             = OFF
PreemptType             = preempt/none
PreemptExemptTime       = 00:00:00
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 12:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = Yes
PriorityFlags           = CALCULATE_RUNNING
PriorityMaxAge          = 2-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 1000
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSize   = 1000
PriorityWeightPartition = 100
PriorityWeightQOS       = 0
PriorityWeightTRES      = (null)
PrivateData             = accounts,reservations,usage
ProctrackType           = proctrack/cgroup
Prolog                  = path
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = path
ResumeRate              = 60 nodes/min
ResumeTimeout           = 2400 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = route/default
SallocDefaultCommand    = (null)
SbcastParameters        = (null)
SchedulerParameters     = bf_continue,default_queue_depth=40000,bf_window=129600,bf_resolution=480,bf_interval=60,bf_max_job_test=100000
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CPU_MEMORY
SlurmUser               = slurm(1252)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = host
SlurmctldLogFile        = (null)
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 3600 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = (null)
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurm/d
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 3600 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = path
SLURM_VERSION           = 19.05.3-2
SrunEpilog              = (null)
SrunPortRange           = 60000-63000
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurm/ctld
SuspendExcNodes         = nodes
SuspendExcParts         = testing
SuspendProgram          = path
SuspendRate             = 10 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 1200 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity,task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 5 sec
TmpFS                   = /scratch
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 30
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 120 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = path
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
ConstrainCores          = yes
ConstrainDevices        = no
ConstrainKmemSpace      = no
ConstrainRAMSpace       = yes
ConstrainSwapSpace      = yes
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = 0
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

-------------- next part --------------
A non-text attachment was scrubbed...
Name: node_topology.png
Type: image/png
Size: 16834 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200814/11301b46/attachment-0001.png>