[slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

Fri Aug 7 17:40:19 UTC 2020

HI Tina,
Thank you so much for looking at this.
slurm 18.08.8

nvidia-smi topo -m
!sys    GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity
GPU0     X      NV2     NV2     NV2     NODE    0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
GPU1    NV2      X      NV2     NV2     NODE    0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
GPU2    NV2     NV2      X      NV2     SYS     1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43
GPU3    NV2     NV2     NV2      X      SYS     1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43
mlx5_0  NODE    NODE    SYS     SYS      X 

I have tried in the gres.conf (without success; only 2 gpu jobs run per node; no cpu jobs are currently running):
NodeName=c0005 Name=gpu File=/dev/nvidia0 CPUs=[0,2,4,6,8,10]
NodeName=c0005 Name=gpu File=/dev/nvidia1 CPUs=[0,2,4,6,8,10]
NodeName=c0005 Name=gpu File=/dev/nvidia2 CPUs=[1,3,5,7,11,13,15,17,29]
NodeName=c0005 Name=gpu File=/dev/nvidia3 CPUs=[1,3,5,7,11,13,15,17,29]

I also tried your suggetions of 0-13, 14-27, and a combo. 
I still only get 2 jobs to run on gpus at a time. If I take off the “CPUs=“, I do get 4 jobs running per node.

Jodie

On Aug 7, 2020, at 12:18 PM, Tina Friedrich <tina.friedrich at it.ox.ac.uk> wrote:

Hi Jodie,

what version of SLURM are you using? I'm pretty sure newer versions pick the topology up automatically (although I'm on 18.08 so I can't verify that).

Is what you're wanting to do - basically - forcefully feed a 'wrong' gres.conf to make SLURM assume all GPUs are on one CPU? (I don't think I've ever tried that!).

I have no idea, unfortunately, what CPU SLURM assigns first - it will not (I don't think) assign cores on the non-GPU CPU first (other people please correct me if I'm wrong!).

My gres.conf files get written by my config management from the GPU topology, I don't think I've ever written one of them manually. And I've never tried to make them anything wrong, i.e. I've never tried to deliberately give a

The GRES conf would probably need to look something like

Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0-13
Name=gpu Type=tesla File=/dev/nvidia1 CPUs=0-13
Name=gpu Type=tesla File=/dev/nvidia2 CPUs=0-13
Name=gpu Type=tesla File=/dev/nvidia3 CPUs=0-13

or maybe

Name=gpu Type=tesla File=/dev/nvidia0 CPUs=14-27
Name=gpu Type=tesla File=/dev/nvidia1 CPUs=14-27
Name=gpu Type=tesla File=/dev/nvidia2 CPUs=14-27
Name=gpu Type=tesla File=/dev/nvidia3 CPUs=14-27

to 'assign' all GPUs to the first 14 CPUs or second 14 CPUs (your config makes me think there are two 14 core CPUs, so cores 0-13 would probably be CPU1 etc?)

(What is the actual topology of the system (according to, say 'nvidia-smi topo -m')?)

Tina

On 07/08/2020 16:31, Jodie H. Sprouse wrote:
> Tina,
> Thank you. Yes, jobs will run on all 4 gpus if I submit with: --gres-flags=disable-binding
> Yet my goal is to have the gpus bind to a cpu in order to allow a cpu-only  job to never run on that particular cpu (having it  bound to the gpu and always free for a gpu job) and give the cpu job the maxcpus minus the 4.
> 
> * Hyperthreading is turned on.
> NodeName=c000[1-5] Gres=gpu:tesla:4 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=190000
> 
> PartitionName=gpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 MaxTime=168:00:00 State=UP OverSubscribe=NO TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=2.0"
> PartitionName=cpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 MaxTime=168:00:00 State=UP OverSubscribe=NO TRESBillingWeights="CPU=.25,Mem=0.25G" MaxCPUsPerNode=48
> 
> I have played tried variations for gres.conf such as:
> NodeName=c0005 Name=gpu File=/dev/nvidia[0-1] CPUs=0,2
> NodeName=c0005 Name=gpu File=/dev/nvidia[2-3] CPUs=1,3
> 
> as well as trying CORES= (rather than CPUSs) with NO success.
> 
> 
> I’ve battled this all week. Any suggestions would be greatly appreciated!
> Thanks for any suggestions!
> Jodie
> 
> 
> On Aug 7, 2020, at 11:12 AM, Tina Friedrich <tina.friedrich at it.ox.ac.uk> wrote:
> 
> Hello,
> 
> This is something I've seen once on our systems & it took me a while to figure out what was going on.
> 
> The solution was that the system topology was such that all GPUs were connected to one CPU. There were no free cores on that particular CPU; so SLURM did not schedule any more jobs to the GPUs. Needed to disable binding in job submission to schedule to all of them.
> 
> Not sure that applies in your situation (don't know your system), but it's something to check?
> 
> Tina
> 
> 
> On 07/08/2020 15:42, Jodie H. Sprouse wrote:
>> Good  morning.
>> I have having the same experience here. Wondering if you had a resolution?
>> Thank you.
>> Jodie
>> 
>> 
>> On Jun 11, 2020, at 3:27 PM, Rhian Resnick <rresnick at fau.edu <mailto:rresnick at fau.edu>> wrote:
>> 
>> We have several users submitting single GPU jobs to our cluster. We expected the jobs to fill each node and fully utilize the available GPU's but we instead find that only 2 out of the 4 gpu's in each node gets allocated.
>> 
>> If we request 2 GPU's in the job and start two jobs, both jobs will start on the same node fully allocating the node. We are puzzled about is going on and any hints are welcome.
>> 
>> Thanks for your help,
>> 
>> Rhian
>> 
>> 
>> 
>> *Example SBATCH Script*
>> #!/bin/bash
>> #SBATCH --job-name=test
>> #SBATCH --partition=longq7-mri
>> #SBATCH -N 1
>> #SBATCH -n 1
>> #SBATCH --gres=gpu:1
>> #SBATCH --mail-type=ALL
>> hostname
>> echo CUDA_VISIBLE_DEVICES $CUDA_VISIBLE_DEVICES
>> 
>> set | grep SLURM
>> nvidia-smi
>> sleep 500
>> 
>> 
>> 
>> 
>> *gres.conf*
>> #AutoDetect=nvml
>> Name=gpu Type=v100  File=/dev/nvidia0 Cores=0
>> Name=gpu Type=v100  File=/dev/nvidia1 Cores=1
>> Name=gpu Type=v100  File=/dev/nvidia2 Cores=2
>> Name=gpu Type=v100  File=/dev/nvidia3 Cores=3
>> 
>> 
>> *slurm.conf*
>> #
>> # Example slurm.conf file. Please run configurator.html
>> # (in doc/html) to build a configuration file customized
>> # for your environment.
>> #
>> #
>> # slurm.conf file generated by configurator.html.
>> #
>> # See the slurm.conf man page for more information.
>> #
>> ClusterName=cluster
>> ControlMachine=cluster-slurm1.example.com <http://cluster-slurm1.example.com/>
>> ControlAddr=10.116.0.11
>> BackupController=cluster-slurm2. <http://cluster-slurm2.example.com/>example.com <http://cluster-slurm2.example.com/>
>> BackupAddr=10.116.0.17
>> #
>> SlurmUser=slurm
>> #SlurmdUser=root
>> SlurmctldPort=6817
>> SlurmdPort=6818
>> SchedulerPort=7321
>> 
>> RebootProgram="/usr/sbin/reboot"
>> 
>> 
>> AuthType=auth/munge
>> #JobCredentialPrivateKey=
>> #JobCredentialPublicCertificate=
>> StateSaveLocation=/var/spool/slurm/ctld
>> SlurmdSpoolDir=/var/spool/slurm/d
>> SwitchType=switch/none
>> MpiDefault=none
>> SlurmctldPidFile=/var/run/slurmctld.pid
>> SlurmdPidFile=/var/run/slurmd.pid
>> ProctrackType=proctrack/pgid
>> 
>> GresTypes=gpu,mps,bandwidth
>> 
>> PrologFlags=x11
>> #PluginDir=
>> #FirstJobId=
>> #MaxJobCount=
>> #PlugStackConfig=
>> #PropagatePrioProcess=
>> #PropagateResourceLimits=
>> #PropagateResourceLimitsExcept=
>> #Prolog=
>> #Epilog=/etc/slurm/slurm.epilog.clean
>> #SrunProlog=
>> #SrunEpilog=
>> #TaskProlog=
>> #TaskEpilog=
>> #TaskPlugin=
>> #TrackWCKey=no
>> #TreeWidth=50
>> #TmpFS=
>> #UsePAM=
>> #
>> # TIMERS
>> SlurmctldTimeout=300
>> SlurmdTimeout=300
>> InactiveLimit=0
>> MinJobAge=300
>> KillWait=30
>> Waittime=0
>> #
>> # SCHEDULING
>> SchedulerType=sched/backfill
>> #bf_interval=10
>> #SchedulerAuth=
>> #SelectType=select/linear
>> # Cores and memory are consumable
>> #SelectType=select/cons_res
>> #SelectTypeParameters=CR_Core_Memory
>> SchedulerParameters=bf_interval=10
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_Core
>> 
>> FastSchedule=1
>> #PriorityType=priority/multifactor
>> #PriorityDecayHalfLife=14-0
>> #PriorityUsageResetPeriod=14-0
>> #PriorityWeightFairshare=100000
>> #PriorityWeightAge=1000
>> #PriorityWeightPartition=10000
>> #PriorityWeightJobSize=1000
>> #PriorityMaxAge=1-0
>> #
>> # LOGGING
>> SlurmctldDebug=3
>> SlurmctldLogFile=/var/log/slurmctld.log
>> SlurmdDebug=3
>> SlurmdLogFile=/var/log/slurmd.log
>> JobCompType=jobcomp/none
>> #JobCompLoc=
>> #
>> # ACCOUNTING
>> #JobAcctGatherType=jobacct_gather/linux
>> #JobAcctGatherFrequency=30
>> #
>> #AccountingStorageType=accounting_storage/slurmdbd
>> #AccountingStorageHost=
>> #AccountingStorageLoc=
>> #AccountingStoragePass=
>> #AccountingStorageUser=
>> #
>> #
>> #
>> # Default values
>> # DefMemPerNode=64000
>> # DefCpuPerGPU=4
>> # DefMemPerCPU=4000
>> # DefMemPerGPU=16000
>> 
>> 
>> 
>> # OpenHPC default configuration
>> #TaskPlugin=task/affinity
>> TaskPlugin=task/affinity,task/cgroup
>> PropagateResourceLimitsExcept=MEMLOCK
>> TaskPluginParam=autobind=cores
>> #AccountingStorageType=accounting_storage/mysql
>> #StorageLoc=slurm_acct_db
>> 
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStorageHost=cluster-slurmdbd1.example.com <http://cluster-slurmdbd1.example.com/>
>> #AccountingStorageType=accounting_storage/filetxt
>> Epilog=/etc/slurm/slurm.epilog.clean
>> 
>> 
>> #PartitionName=normal Nodes=c[1-5] Default=YES MaxTime=24:00:00 State=UP
>> PartitionName=DEFAULT State=UP Default=NO AllowGroups=ALL Priority=10 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO  Nodes=nodeamd[009-016],c[1-4],nodehtc[001-025]
>> 
>> 
>> # Partitions
>> 
>> # Group Limited Queues
>> 
>> # OIT DEBUG QUEUE
>> PartitionName=debug Nodes=c[1-4] MaxTime=24:00:00 State=UP AllowGroups=oit-hpc-admin
>> 
>> # RNA CHEM
>> PartitionName=longq7-rna MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=UNLIMITED Priority=200 Nodes=nodeamd[001-008],nodegpu[021-025] AllowGroups=gpu-rnachem
>> 
>> # V100's
>> PartitionName=longq7-mri MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=168:00:00 Priority=200 Nodes=nodenviv100[001-016] AllowGroups=gpu-mri
>> 
>> # BIGDATA GRANT
>> PartitionName=longq-bigdata7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=168:00:00 Priority=200 Nodes=node[087-098],nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata
>> 
>> PartitionName=gpu-bigdata7 Default=NO MinNodes=1 Priority=10  AllowAccounts=ALL  Nodes=nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata
>> 
>> # CogNeuroLab
>> PartitionName=CogNeuroLab Default=NO MinNodes=1 MaxNodes=4 MaxTime=7-12:00:00 AllowGroups=cogneurolab Priority=200 State=UP Nodes=node[001-004]
>> 
>> 
>> # Standard queues
>> 
>> # OPEN TO ALL
>> 
>> #Short Queue
>> PartitionName=shortq7 MinNodes=1 MaxNodes=30 DefaultTime=06:00:00 MaxTime=06:00:00 Priority=100 Nodes=nodeamd[001-016],nodenviv100[001-015],nodegpu[001-025],node[001-100],nodehtc[001-025]  Default=YES
>> 
>> # Medium Queue
>> PartitionName=mediumq7 MinNodes=1 MaxNodes=30 DefaultTime=72:00:00 MaxTime=72:00:00 Priority=50 Nodes=nodeamd[009-016],node[004-100]
>> 
>> # Long Queue
>> PartitionName=longq7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=168:00:00 Priority=30 Nodes=nodeamd[009-016],node[004-100]
>> 
>> 
>> # Interactive
>> PartitionName=interactive MinNodes=1 MaxNodes=4 DefaultTime=06:00:00 MaxTime=06:00:00 Priority=101 Nodes=node[001-100]  Default=No Hidden=YES
>> 
>> # Nodes
>> 
>> # Test nodes, (vms)
>> NodeName=c[1-4] Cpus=4 Feature=virtual RealMemory=16000
>> 
>> # AMD Nodes
>> NodeName=nodeamd[001-016] Procs=64 Boards=1 SocketsPerBoard=8 CoresPerSocket=8 ThreadsPerCore=1 Features=amd,epyc RealMemory=225436
>> 
>> # V100 MRI
>> NodeName=nodenviv100[001-016] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:v100:4 Feature=v100 RealMemory=192006
>> 
>> # GPU nodes
>> NodeName=nodegpu001 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=gpu:k80:8 Feature=k80,intel RealMemory=64000
>> NodeName=nodegpu002 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000
>> NodeName=nodegpu[003-020] Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000
>> NodeName=nodegpu[021-025] Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 Gres=gpu:4 Feature=exxact,intel RealMemory=128000
>> 
>> # IvyBridge nodes
>> NodeName=node[001-021] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750
>> # SandyBridge node(2)
>> NodeName=node022 Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 Feature=intel,sandybridge RealMemory=64000
>> # IvyBridge
>> NodeName=node[023-050] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750
>> # Haswell
>> NodeName=node[051-100] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,haswell RealMemory=112750
>> 
>> 
>> # Node health monitoring
>> HealthCheckProgram=/usr/sbin/nhc
>> HealthCheckInterval=300
>> ReturnToService=2
>> 
>> # Fix for X11 issues
>> X11Parameters=use_raw_hostname
>> 
>> 
>> 
>> Rhian Resnick
>> Associate Director Research Computing
>> Enterprise Systems
>> Office of Information Technology
>> 
>> Florida Atlantic University
>> 777 Glades Road, CM22, Rm 173B
>> Boca Raton, FL 33431
>> Phone 561.297.2647
>> Fax 561.297.0222
>> 
>