[slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

Renfro, Michael Renfro at tntech.edu
Fri Aug 7 18:46:01 UTC 2020


I’ve only got 2 GPUs in my nodes, but I’ve always used non-overlapping CPUs= or COREs= settings. Currently, they’re:

  NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia[0-1] COREs=0-7,9-15

and I’ve got 2 jobs currently running on each node that’s available.

So maybe:

  NodeName=c0005 Name=gpu File=/dev/nvidia[0-3] CPUs=0-10,11-21,22-32,33-43

would work?

> On Aug 7, 2020, at 12:40 PM, Jodie H. Sprouse <jhs43 at cornell.edu> wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
> 
> ________________________________
> 
> HI Tina,
> Thank you so much for looking at this.
> slurm 18.08.8
> 
> nvidia-smi topo -m
> !sys    GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity
> GPU0     X      NV2     NV2     NV2     NODE    0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
> GPU1    NV2      X      NV2     NV2     NODE    0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
> GPU2    NV2     NV2      X      NV2     SYS     1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43
> GPU3    NV2     NV2     NV2      X      SYS     1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43
> mlx5_0  NODE    NODE    SYS     SYS      X
> 
> I have tried in the gres.conf (without success; only 2 gpu jobs run per node; no cpu jobs are currently running):
> NodeName=c0005 Name=gpu File=/dev/nvidia0 CPUs=[0,2,4,6,8,10]
> NodeName=c0005 Name=gpu File=/dev/nvidia1 CPUs=[0,2,4,6,8,10]
> NodeName=c0005 Name=gpu File=/dev/nvidia2 CPUs=[1,3,5,7,11,13,15,17,29]
> NodeName=c0005 Name=gpu File=/dev/nvidia3 CPUs=[1,3,5,7,11,13,15,17,29]
> 
> I also tried your suggetions of 0-13, 14-27, and a combo.
> I still only get 2 jobs to run on gpus at a time. If I take off the “CPUs=“, I do get 4 jobs running per node.
> 
> Jodie
> 
> 
> On Aug 7, 2020, at 12:18 PM, Tina Friedrich <tina.friedrich at it.ox.ac.uk> wrote:
> 
> Hi Jodie,
> 
> what version of SLURM are you using? I'm pretty sure newer versions pick the topology up automatically (although I'm on 18.08 so I can't verify that).
> 
> Is what you're wanting to do - basically - forcefully feed a 'wrong' gres.conf to make SLURM assume all GPUs are on one CPU? (I don't think I've ever tried that!).
> 
> I have no idea, unfortunately, what CPU SLURM assigns first - it will not (I don't think) assign cores on the non-GPU CPU first (other people please correct me if I'm wrong!).
> 
> My gres.conf files get written by my config management from the GPU topology, I don't think I've ever written one of them manually. And I've never tried to make them anything wrong, i.e. I've never tried to deliberately give a
> 
> The GRES conf would probably need to look something like
> 
> Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0-13
> Name=gpu Type=tesla File=/dev/nvidia1 CPUs=0-13
> Name=gpu Type=tesla File=/dev/nvidia2 CPUs=0-13
> Name=gpu Type=tesla File=/dev/nvidia3 CPUs=0-13
> 
> or maybe
> 
> Name=gpu Type=tesla File=/dev/nvidia0 CPUs=14-27
> Name=gpu Type=tesla File=/dev/nvidia1 CPUs=14-27
> Name=gpu Type=tesla File=/dev/nvidia2 CPUs=14-27
> Name=gpu Type=tesla File=/dev/nvidia3 CPUs=14-27
> 
> to 'assign' all GPUs to the first 14 CPUs or second 14 CPUs (your config makes me think there are two 14 core CPUs, so cores 0-13 would probably be CPU1 etc?)
> 
> (What is the actual topology of the system (according to, say 'nvidia-smi topo -m')?)
> 
> Tina
> 
> On 07/08/2020 16:31, Jodie H. Sprouse wrote:
>> Tina,
>> Thank you. Yes, jobs will run on all 4 gpus if I submit with: --gres-flags=disable-binding
>> Yet my goal is to have the gpus bind to a cpu in order to allow a cpu-only  job to never run on that particular cpu (having it  bound to the gpu and always free for a gpu job) and give the cpu job the maxcpus minus the 4.
>> 
>> * Hyperthreading is turned on.
>> NodeName=c000[1-5] Gres=gpu:tesla:4 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=190000
>> 
>> PartitionName=gpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 MaxTime=168:00:00 State=UP OverSubscribe=NO TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=2.0"
>> PartitionName=cpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 MaxTime=168:00:00 State=UP OverSubscribe=NO TRESBillingWeights="CPU=.25,Mem=0.25G" MaxCPUsPerNode=48
>> 
>> I have played tried variations for gres.conf such as:
>> NodeName=c0005 Name=gpu File=/dev/nvidia[0-1] CPUs=0,2
>> NodeName=c0005 Name=gpu File=/dev/nvidia[2-3] CPUs=1,3
>> 
>> as well as trying CORES= (rather than CPUSs) with NO success.
>> 
>> 
>> I’ve battled this all week. Any suggestions would be greatly appreciated!
>> Thanks for any suggestions!
>> Jodie
>> 
>> 
>> On Aug 7, 2020, at 11:12 AM, Tina Friedrich <tina.friedrich at it.ox.ac.uk> wrote:
>> 
>> Hello,
>> 
>> This is something I've seen once on our systems & it took me a while to figure out what was going on.
>> 
>> The solution was that the system topology was such that all GPUs were connected to one CPU. There were no free cores on that particular CPU; so SLURM did not schedule any more jobs to the GPUs. Needed to disable binding in job submission to schedule to all of them.
>> 
>> Not sure that applies in your situation (don't know your system), but it's something to check?
>> 
>> Tina
>> 
>> 
>> On 07/08/2020 15:42, Jodie H. Sprouse wrote:
>>> Good  morning.
>>> I have having the same experience here. Wondering if you had a resolution?
>>> Thank you.
>>> Jodie
>>> 
>>> 
>>> On Jun 11, 2020, at 3:27 PM, Rhian Resnick <rresnick at fau.edu <mailto:rresnick at fau.edu>> wrote:
>>> 
>>> We have several users submitting single GPU jobs to our cluster. We expected the jobs to fill each node and fully utilize the available GPU's but we instead find that only 2 out of the 4 gpu's in each node gets allocated.
>>> 
>>> If we request 2 GPU's in the job and start two jobs, both jobs will start on the same node fully allocating the node. We are puzzled about is going on and any hints are welcome.
>>> 
>>> Thanks for your help,
>>> 
>>> Rhian
>>> 
>>> 
>>> 
>>> *Example SBATCH Script*
>>> #!/bin/bash
>>> #SBATCH --job-name=test
>>> #SBATCH --partition=longq7-mri
>>> #SBATCH -N 1
>>> #SBATCH -n 1
>>> #SBATCH --gres=gpu:1
>>> #SBATCH --mail-type=ALL
>>> hostname
>>> echo CUDA_VISIBLE_DEVICES $CUDA_VISIBLE_DEVICES
>>> 
>>> set | grep SLURM
>>> nvidia-smi
>>> sleep 500
>>> 
>>> 
>>> 
>>> 
>>> *gres.conf*
>>> #AutoDetect=nvml
>>> Name=gpu Type=v100  File=/dev/nvidia0 Cores=0
>>> Name=gpu Type=v100  File=/dev/nvidia1 Cores=1
>>> Name=gpu Type=v100  File=/dev/nvidia2 Cores=2
>>> Name=gpu Type=v100  File=/dev/nvidia3 Cores=3
>>> 
>>> 
>>> *slurm.conf*
>>> #
>>> # Example slurm.conf file. Please run configurator.html
>>> # (in doc/html) to build a configuration file customized
>>> # for your environment.
>>> #
>>> #
>>> # slurm.conf file generated by configurator.html.
>>> #
>>> # See the slurm.conf man page for more information.
>>> #
>>> ClusterName=cluster
>>> ControlMachine=cluster-slurm1.example.com <http://cluster-slurm1.example.com/>
>>> ControlAddr=10.116.0.11
>>> BackupController=cluster-slurm2. <http://cluster-slurm2.example.com/>example.com <http://cluster-slurm2.example.com/>
>>> BackupAddr=10.116.0.17
>>> #
>>> SlurmUser=slurm
>>> #SlurmdUser=root
>>> SlurmctldPort=6817
>>> SlurmdPort=6818
>>> SchedulerPort=7321
>>> 
>>> RebootProgram="/usr/sbin/reboot"
>>> 
>>> 
>>> AuthType=auth/munge
>>> #JobCredentialPrivateKey=
>>> #JobCredentialPublicCertificate=
>>> StateSaveLocation=/var/spool/slurm/ctld
>>> SlurmdSpoolDir=/var/spool/slurm/d
>>> SwitchType=switch/none
>>> MpiDefault=none
>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>> SlurmdPidFile=/var/run/slurmd.pid
>>> ProctrackType=proctrack/pgid
>>> 
>>> GresTypes=gpu,mps,bandwidth
>>> 
>>> PrologFlags=x11
>>> #PluginDir=
>>> #FirstJobId=
>>> #MaxJobCount=
>>> #PlugStackConfig=
>>> #PropagatePrioProcess=
>>> #PropagateResourceLimits=
>>> #PropagateResourceLimitsExcept=
>>> #Prolog=
>>> #Epilog=/etc/slurm/slurm.epilog.clean
>>> #SrunProlog=
>>> #SrunEpilog=
>>> #TaskProlog=
>>> #TaskEpilog=
>>> #TaskPlugin=
>>> #TrackWCKey=no
>>> #TreeWidth=50
>>> #TmpFS=
>>> #UsePAM=
>>> #
>>> # TIMERS
>>> SlurmctldTimeout=300
>>> SlurmdTimeout=300
>>> InactiveLimit=0
>>> MinJobAge=300
>>> KillWait=30
>>> Waittime=0
>>> #
>>> # SCHEDULING
>>> SchedulerType=sched/backfill
>>> #bf_interval=10
>>> #SchedulerAuth=
>>> #SelectType=select/linear
>>> # Cores and memory are consumable
>>> #SelectType=select/cons_res
>>> #SelectTypeParameters=CR_Core_Memory
>>> SchedulerParameters=bf_interval=10
>>> SelectType=select/cons_res
>>> SelectTypeParameters=CR_Core
>>> 
>>> FastSchedule=1
>>> #PriorityType=priority/multifactor
>>> #PriorityDecayHalfLife=14-0
>>> #PriorityUsageResetPeriod=14-0
>>> #PriorityWeightFairshare=100000
>>> #PriorityWeightAge=1000
>>> #PriorityWeightPartition=10000
>>> #PriorityWeightJobSize=1000
>>> #PriorityMaxAge=1-0
>>> #
>>> # LOGGING
>>> SlurmctldDebug=3
>>> SlurmctldLogFile=/var/log/slurmctld.log
>>> SlurmdDebug=3
>>> SlurmdLogFile=/var/log/slurmd.log
>>> JobCompType=jobcomp/none
>>> #JobCompLoc=
>>> #
>>> # ACCOUNTING
>>> #JobAcctGatherType=jobacct_gather/linux
>>> #JobAcctGatherFrequency=30
>>> #
>>> #AccountingStorageType=accounting_storage/slurmdbd
>>> #AccountingStorageHost=
>>> #AccountingStorageLoc=
>>> #AccountingStoragePass=
>>> #AccountingStorageUser=
>>> #
>>> #
>>> #
>>> # Default values
>>> # DefMemPerNode=64000
>>> # DefCpuPerGPU=4
>>> # DefMemPerCPU=4000
>>> # DefMemPerGPU=16000
>>> 
>>> 
>>> 
>>> # OpenHPC default configuration
>>> #TaskPlugin=task/affinity
>>> TaskPlugin=task/affinity,task/cgroup
>>> PropagateResourceLimitsExcept=MEMLOCK
>>> TaskPluginParam=autobind=cores
>>> #AccountingStorageType=accounting_storage/mysql
>>> #StorageLoc=slurm_acct_db
>>> 
>>> AccountingStorageType=accounting_storage/slurmdbd
>>> AccountingStorageHost=cluster-slurmdbd1.example.com <http://cluster-slurmdbd1.example.com/>
>>> #AccountingStorageType=accounting_storage/filetxt
>>> Epilog=/etc/slurm/slurm.epilog.clean
>>> 
>>> 
>>> #PartitionName=normal Nodes=c[1-5] Default=YES MaxTime=24:00:00 State=UP
>>> PartitionName=DEFAULT State=UP Default=NO AllowGroups=ALL Priority=10 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO  Nodes=nodeamd[009-016],c[1-4],nodehtc[001-025]
>>> 
>>> 
>>> # Partitions
>>> 
>>> # Group Limited Queues
>>> 
>>> # OIT DEBUG QUEUE
>>> PartitionName=debug Nodes=c[1-4] MaxTime=24:00:00 State=UP AllowGroups=oit-hpc-admin
>>> 
>>> # RNA CHEM
>>> PartitionName=longq7-rna MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=UNLIMITED Priority=200 Nodes=nodeamd[001-008],nodegpu[021-025] AllowGroups=gpu-rnachem
>>> 
>>> # V100's
>>> PartitionName=longq7-mri MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=168:00:00 Priority=200 Nodes=nodenviv100[001-016] AllowGroups=gpu-mri
>>> 
>>> # BIGDATA GRANT
>>> PartitionName=longq-bigdata7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=168:00:00 Priority=200 Nodes=node[087-098],nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata
>>> 
>>> PartitionName=gpu-bigdata7 Default=NO MinNodes=1 Priority=10  AllowAccounts=ALL  Nodes=nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata
>>> 
>>> # CogNeuroLab
>>> PartitionName=CogNeuroLab Default=NO MinNodes=1 MaxNodes=4 MaxTime=7-12:00:00 AllowGroups=cogneurolab Priority=200 State=UP Nodes=node[001-004]
>>> 
>>> 
>>> # Standard queues
>>> 
>>> # OPEN TO ALL
>>> 
>>> #Short Queue
>>> PartitionName=shortq7 MinNodes=1 MaxNodes=30 DefaultTime=06:00:00 MaxTime=06:00:00 Priority=100 Nodes=nodeamd[001-016],nodenviv100[001-015],nodegpu[001-025],node[001-100],nodehtc[001-025]  Default=YES
>>> 
>>> # Medium Queue
>>> PartitionName=mediumq7 MinNodes=1 MaxNodes=30 DefaultTime=72:00:00 MaxTime=72:00:00 Priority=50 Nodes=nodeamd[009-016],node[004-100]
>>> 
>>> # Long Queue
>>> PartitionName=longq7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=168:00:00 Priority=30 Nodes=nodeamd[009-016],node[004-100]
>>> 
>>> 
>>> # Interactive
>>> PartitionName=interactive MinNodes=1 MaxNodes=4 DefaultTime=06:00:00 MaxTime=06:00:00 Priority=101 Nodes=node[001-100]  Default=No Hidden=YES
>>> 
>>> # Nodes
>>> 
>>> # Test nodes, (vms)
>>> NodeName=c[1-4] Cpus=4 Feature=virtual RealMemory=16000
>>> 
>>> # AMD Nodes
>>> NodeName=nodeamd[001-016] Procs=64 Boards=1 SocketsPerBoard=8 CoresPerSocket=8 ThreadsPerCore=1 Features=amd,epyc RealMemory=225436
>>> 
>>> # V100 MRI
>>> NodeName=nodenviv100[001-016] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:v100:4 Feature=v100 RealMemory=192006
>>> 
>>> # GPU nodes
>>> NodeName=nodegpu001 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=gpu:k80:8 Feature=k80,intel RealMemory=64000
>>> NodeName=nodegpu002 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000
>>> NodeName=nodegpu[003-020] Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000
>>> NodeName=nodegpu[021-025] Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 Gres=gpu:4 Feature=exxact,intel RealMemory=128000
>>> 
>>> # IvyBridge nodes
>>> NodeName=node[001-021] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750
>>> # SandyBridge node(2)
>>> NodeName=node022 Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 Feature=intel,sandybridge RealMemory=64000
>>> # IvyBridge
>>> NodeName=node[023-050] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750
>>> # Haswell
>>> NodeName=node[051-100] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,haswell RealMemory=112750
>>> 
>>> 
>>> # Node health monitoring
>>> HealthCheckProgram=/usr/sbin/nhc
>>> HealthCheckInterval=300
>>> ReturnToService=2
>>> 
>>> # Fix for X11 issues
>>> X11Parameters=use_raw_hostname
>>> 
>>> 
>>> 
>>> Rhian Resnick
>>> Associate Director Research Computing
>>> Enterprise Systems
>>> Office of Information Technology
>>> 
>>> Florida Atlantic University
>>> 777 Glades Road, CM22, Rm 173B
>>> Boca Raton, FL 33431
>>> Phone 561.297.2647
>>> Fax 561.297.0222
>>> 
>> 
> 
> 



More information about the slurm-users mailing list