[slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

Jodie H. Sprouse jhs43 at cornell.edu
Wed Aug 12 21:06:35 UTC 2020


Hello Tina,
Thank you for the suggestions and responses!!!
As of right now, it seems to be working with taking off the “CPUs=“ all together from gres.conf. The original thought process was to have 4 set aside to always go to the gpu; not so sure that is necessary as long as the CPU partition can never grab more than 48. I have set MaxCPUsPerNode=48 for the cpu partition & MaxCPUsPerNode=8 for the gpu partition.
More users will be getting on in the upcoming weeks; I will keep watch. Now onward to be sure I have the TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=1.0” set correctly & we do not see jobs starved out.
Thank you again!
Jodie


On Aug 10, 2020, at 10:31 AM, Tina Friedrich <tina.friedrich at it.ox.ac.uk> wrote:

Hello,

yes, that would probably work; or simply taking the "CPUs=" off, really.

However, I think what Jodie's trying to do is force all GPU jobs onto one of the CPUs; not allowing all GPU jobs to spread over all processors, regardless of afinity.

Jodie - can you try if

NodeName=c0005 Name=gpu File=/dev/nvidia[0-3] CPUs=0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42

gets you there?

Tina

On 07/08/2020 19:46, Renfro, Michael wrote:
> I’ve only got 2 GPUs in my nodes, but I’ve always used non-overlapping CPUs= or COREs= settings. Currently, they’re:
> 
>   NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia[0-1] COREs=0-7,9-15
> 
> and I’ve got 2 jobs currently running on each node that’s available.
> 
> So maybe:
> 
>   NodeName=c0005 Name=gpu File=/dev/nvidia[0-3] CPUs=0-10,11-21,22-32,33-43
> 
> would work?
> 
>> On Aug 7, 2020, at 12:40 PM, Jodie H. Sprouse <jhs43 at cornell.edu> wrote:
>> 
>> External Email Warning
>> 
>> This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
>> 
>> ________________________________
>> 
>> HI Tina,
>> Thank you so much for looking at this.
>> slurm 18.08.8
>> 
>> nvidia-smi topo -m
>> !sys    GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity
>> GPU0     X      NV2     NV2     NV2     NODE    0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
>> GPU1    NV2      X      NV2     NV2     NODE    0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
>> GPU2    NV2     NV2      X      NV2     SYS     1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43
>> GPU3    NV2     NV2     NV2      X      SYS     1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43
>> mlx5_0  NODE    NODE    SYS     SYS      X
>> 
>> I have tried in the gres.conf (without success; only 2 gpu jobs run per node; no cpu jobs are currently running):
>> NodeName=c0005 Name=gpu File=/dev/nvidia0 CPUs=[0,2,4,6,8,10]
>> NodeName=c0005 Name=gpu File=/dev/nvidia1 CPUs=[0,2,4,6,8,10]
>> NodeName=c0005 Name=gpu File=/dev/nvidia2 CPUs=[1,3,5,7,11,13,15,17,29]
>> NodeName=c0005 Name=gpu File=/dev/nvidia3 CPUs=[1,3,5,7,11,13,15,17,29]
>> 
>> I also tried your suggetions of 0-13, 14-27, and a combo.
>> I still only get 2 jobs to run on gpus at a time. If I take off the “CPUs=“, I do get 4 jobs running per node.
>> 
>> Jodie
>> 
>> 
>> On Aug 7, 2020, at 12:18 PM, Tina Friedrich <tina.friedrich at it.ox.ac.uk> wrote:
>> 
>> Hi Jodie,
>> 
>> what version of SLURM are you using? I'm pretty sure newer versions pick the topology up automatically (although I'm on 18.08 so I can't verify that).
>> 
>> Is what you're wanting to do - basically - forcefully feed a 'wrong' gres.conf to make SLURM assume all GPUs are on one CPU? (I don't think I've ever tried that!).
>> 
>> I have no idea, unfortunately, what CPU SLURM assigns first - it will not (I don't think) assign cores on the non-GPU CPU first (other people please correct me if I'm wrong!).
>> 
>> My gres.conf files get written by my config management from the GPU topology, I don't think I've ever written one of them manually. And I've never tried to make them anything wrong, i.e. I've never tried to deliberately give a
>> 
>> The GRES conf would probably need to look something like
>> 
>> Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0-13
>> Name=gpu Type=tesla File=/dev/nvidia1 CPUs=0-13
>> Name=gpu Type=tesla File=/dev/nvidia2 CPUs=0-13
>> Name=gpu Type=tesla File=/dev/nvidia3 CPUs=0-13
>> 
>> or maybe
>> 
>> Name=gpu Type=tesla File=/dev/nvidia0 CPUs=14-27
>> Name=gpu Type=tesla File=/dev/nvidia1 CPUs=14-27
>> Name=gpu Type=tesla File=/dev/nvidia2 CPUs=14-27
>> Name=gpu Type=tesla File=/dev/nvidia3 CPUs=14-27
>> 
>> to 'assign' all GPUs to the first 14 CPUs or second 14 CPUs (your config makes me think there are two 14 core CPUs, so cores 0-13 would probably be CPU1 etc?)
>> 
>> (What is the actual topology of the system (according to, say 'nvidia-smi topo -m')?)
>> 
>> Tina
>> 
>> On 07/08/2020 16:31, Jodie H. Sprouse wrote:
>>> Tina,
>>> Thank you. Yes, jobs will run on all 4 gpus if I submit with: --gres-flags=disable-binding
>>> Yet my goal is to have the gpus bind to a cpu in order to allow a cpu-only  job to never run on that particular cpu (having it  bound to the gpu and always free for a gpu job) and give the cpu job the maxcpus minus the 4.
>>> 
>>> * Hyperthreading is turned on.
>>> NodeName=c000[1-5] Gres=gpu:tesla:4 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=190000
>>> 
>>> PartitionName=gpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 MaxTime=168:00:00 State=UP OverSubscribe=NO TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=2.0"
>>> PartitionName=cpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 MaxTime=168:00:00 State=UP OverSubscribe=NO TRESBillingWeights="CPU=.25,Mem=0.25G" MaxCPUsPerNode=48
>>> 
>>> I have played tried variations for gres.conf such as:
>>> NodeName=c0005 Name=gpu File=/dev/nvidia[0-1] CPUs=0,2
>>> NodeName=c0005 Name=gpu File=/dev/nvidia[2-3] CPUs=1,3
>>> 
>>> as well as trying CORES= (rather than CPUSs) with NO success.
>>> 
>>> 
>>> I’ve battled this all week. Any suggestions would be greatly appreciated!
>>> Thanks for any suggestions!
>>> Jodie
>>> 
>>> 
>>> On Aug 7, 2020, at 11:12 AM, Tina Friedrich <tina.friedrich at it.ox.ac.uk> wrote:
>>> 
>>> Hello,
>>> 
>>> This is something I've seen once on our systems & it took me a while to figure out what was going on.
>>> 
>>> The solution was that the system topology was such that all GPUs were connected to one CPU. There were no free cores on that particular CPU; so SLURM did not schedule any more jobs to the GPUs. Needed to disable binding in job submission to schedule to all of them.
>>> 
>>> Not sure that applies in your situation (don't know your system), but it's something to check?
>>> 
>>> Tina
>>> 
>>> 
>>> On 07/08/2020 15:42, Jodie H. Sprouse wrote:
>>>> Good  morning.
>>>> I have having the same experience here. Wondering if you had a resolution?
>>>> Thank you.
>>>> Jodie
>>>> 
>>>> 
>>>> On Jun 11, 2020, at 3:27 PM, Rhian Resnick <rresnick at fau.edu <mailto:rresnick at fau.edu>> wrote:
>>>> 
>>>> We have several users submitting single GPU jobs to our cluster. We expected the jobs to fill each node and fully utilize the available GPU's but we instead find that only 2 out of the 4 gpu's in each node gets allocated.
>>>> 
>>>> If we request 2 GPU's in the job and start two jobs, both jobs will start on the same node fully allocating the node. We are puzzled about is going on and any hints are welcome.
>>>> 
>>>> Thanks for your help,
>>>> 
>>>> Rhian
>>>> 
>>>> 
>>>> 
>>>> *Example SBATCH Script*
>>>> #!/bin/bash
>>>> #SBATCH --job-name=test
>>>> #SBATCH --partition=longq7-mri
>>>> #SBATCH -N 1
>>>> #SBATCH -n 1
>>>> #SBATCH --gres=gpu:1
>>>> #SBATCH --mail-type=ALL
>>>> hostname
>>>> echo CUDA_VISIBLE_DEVICES $CUDA_VISIBLE_DEVICES
>>>> 
>>>> set | grep SLURM
>>>> nvidia-smi
>>>> sleep 500
>>>> 
>>>> 
>>>> 
>>>> 
>>>> *gres.conf*
>>>> #AutoDetect=nvml
>>>> Name=gpu Type=v100  File=/dev/nvidia0 Cores=0
>>>> Name=gpu Type=v100  File=/dev/nvidia1 Cores=1
>>>> Name=gpu Type=v100  File=/dev/nvidia2 Cores=2
>>>> Name=gpu Type=v100  File=/dev/nvidia3 Cores=3
>>>> 
>>>> 
>>>> *slurm.conf*
>>>> #
>>>> # Example slurm.conf file. Please run configurator.html
>>>> # (in doc/html) to build a configuration file customized
>>>> # for your environment.
>>>> #
>>>> #
>>>> # slurm.conf file generated by configurator.html.
>>>> #
>>>> # See the slurm.conf man page for more information.
>>>> #
>>>> ClusterName=cluster
>>>> ControlMachine=cluster-slurm1.example.com <http://cluster-slurm1.example.com/>
>>>> ControlAddr=10.116.0.11
>>>> BackupController=cluster-slurm2. <http://cluster-slurm2.example.com/>example.com <http://cluster-slurm2.example.com/>
>>>> BackupAddr=10.116.0.17
>>>> #
>>>> SlurmUser=slurm
>>>> #SlurmdUser=root
>>>> SlurmctldPort=6817
>>>> SlurmdPort=6818
>>>> SchedulerPort=7321
>>>> 
>>>> RebootProgram="/usr/sbin/reboot"
>>>> 
>>>> 
>>>> AuthType=auth/munge
>>>> #JobCredentialPrivateKey=
>>>> #JobCredentialPublicCertificate=
>>>> StateSaveLocation=/var/spool/slurm/ctld
>>>> SlurmdSpoolDir=/var/spool/slurm/d
>>>> SwitchType=switch/none
>>>> MpiDefault=none
>>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>>> SlurmdPidFile=/var/run/slurmd.pid
>>>> ProctrackType=proctrack/pgid
>>>> 
>>>> GresTypes=gpu,mps,bandwidth
>>>> 
>>>> PrologFlags=x11
>>>> #PluginDir=
>>>> #FirstJobId=
>>>> #MaxJobCount=
>>>> #PlugStackConfig=
>>>> #PropagatePrioProcess=
>>>> #PropagateResourceLimits=
>>>> #PropagateResourceLimitsExcept=
>>>> #Prolog=
>>>> #Epilog=/etc/slurm/slurm.epilog.clean
>>>> #SrunProlog=
>>>> #SrunEpilog=
>>>> #TaskProlog=
>>>> #TaskEpilog=
>>>> #TaskPlugin=
>>>> #TrackWCKey=no
>>>> #TreeWidth=50
>>>> #TmpFS=
>>>> #UsePAM=
>>>> #
>>>> # TIMERS
>>>> SlurmctldTimeout=300
>>>> SlurmdTimeout=300
>>>> InactiveLimit=0
>>>> MinJobAge=300
>>>> KillWait=30
>>>> Waittime=0
>>>> #
>>>> # SCHEDULING
>>>> SchedulerType=sched/backfill
>>>> #bf_interval=10
>>>> #SchedulerAuth=
>>>> #SelectType=select/linear
>>>> # Cores and memory are consumable
>>>> #SelectType=select/cons_res
>>>> #SelectTypeParameters=CR_Core_Memory
>>>> SchedulerParameters=bf_interval=10
>>>> SelectType=select/cons_res
>>>> SelectTypeParameters=CR_Core
>>>> 
>>>> FastSchedule=1
>>>> #PriorityType=priority/multifactor
>>>> #PriorityDecayHalfLife=14-0
>>>> #PriorityUsageResetPeriod=14-0
>>>> #PriorityWeightFairshare=100000
>>>> #PriorityWeightAge=1000
>>>> #PriorityWeightPartition=10000
>>>> #PriorityWeightJobSize=1000
>>>> #PriorityMaxAge=1-0
>>>> #
>>>> # LOGGING
>>>> SlurmctldDebug=3
>>>> SlurmctldLogFile=/var/log/slurmctld.log
>>>> SlurmdDebug=3
>>>> SlurmdLogFile=/var/log/slurmd.log
>>>> JobCompType=jobcomp/none
>>>> #JobCompLoc=
>>>> #
>>>> # ACCOUNTING
>>>> #JobAcctGatherType=jobacct_gather/linux
>>>> #JobAcctGatherFrequency=30
>>>> #
>>>> #AccountingStorageType=accounting_storage/slurmdbd
>>>> #AccountingStorageHost=
>>>> #AccountingStorageLoc=
>>>> #AccountingStoragePass=
>>>> #AccountingStorageUser=
>>>> #
>>>> #
>>>> #
>>>> # Default values
>>>> # DefMemPerNode=64000
>>>> # DefCpuPerGPU=4
>>>> # DefMemPerCPU=4000
>>>> # DefMemPerGPU=16000
>>>> 
>>>> 
>>>> 
>>>> # OpenHPC default configuration
>>>> #TaskPlugin=task/affinity
>>>> TaskPlugin=task/affinity,task/cgroup
>>>> PropagateResourceLimitsExcept=MEMLOCK
>>>> TaskPluginParam=autobind=cores
>>>> #AccountingStorageType=accounting_storage/mysql
>>>> #StorageLoc=slurm_acct_db
>>>> 
>>>> AccountingStorageType=accounting_storage/slurmdbd
>>>> AccountingStorageHost=cluster-slurmdbd1.example.com <http://cluster-slurmdbd1.example.com/>
>>>> #AccountingStorageType=accounting_storage/filetxt
>>>> Epilog=/etc/slurm/slurm.epilog.clean
>>>> 
>>>> 
>>>> #PartitionName=normal Nodes=c[1-5] Default=YES MaxTime=24:00:00 State=UP
>>>> PartitionName=DEFAULT State=UP Default=NO AllowGroups=ALL Priority=10 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO  Nodes=nodeamd[009-016],c[1-4],nodehtc[001-025]
>>>> 
>>>> 
>>>> # Partitions
>>>> 
>>>> # Group Limited Queues
>>>> 
>>>> # OIT DEBUG QUEUE
>>>> PartitionName=debug Nodes=c[1-4] MaxTime=24:00:00 State=UP AllowGroups=oit-hpc-admin
>>>> 
>>>> # RNA CHEM
>>>> PartitionName=longq7-rna MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=UNLIMITED Priority=200 Nodes=nodeamd[001-008],nodegpu[021-025] AllowGroups=gpu-rnachem
>>>> 
>>>> # V100's
>>>> PartitionName=longq7-mri MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=168:00:00 Priority=200 Nodes=nodenviv100[001-016] AllowGroups=gpu-mri
>>>> 
>>>> # BIGDATA GRANT
>>>> PartitionName=longq-bigdata7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=168:00:00 Priority=200 Nodes=node[087-098],nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata
>>>> 
>>>> PartitionName=gpu-bigdata7 Default=NO MinNodes=1 Priority=10  AllowAccounts=ALL  Nodes=nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata
>>>> 
>>>> # CogNeuroLab
>>>> PartitionName=CogNeuroLab Default=NO MinNodes=1 MaxNodes=4 MaxTime=7-12:00:00 AllowGroups=cogneurolab Priority=200 State=UP Nodes=node[001-004]
>>>> 
>>>> 
>>>> # Standard queues
>>>> 
>>>> # OPEN TO ALL
>>>> 
>>>> #Short Queue
>>>> PartitionName=shortq7 MinNodes=1 MaxNodes=30 DefaultTime=06:00:00 MaxTime=06:00:00 Priority=100 Nodes=nodeamd[001-016],nodenviv100[001-015],nodegpu[001-025],node[001-100],nodehtc[001-025]  Default=YES
>>>> 
>>>> # Medium Queue
>>>> PartitionName=mediumq7 MinNodes=1 MaxNodes=30 DefaultTime=72:00:00 MaxTime=72:00:00 Priority=50 Nodes=nodeamd[009-016],node[004-100]
>>>> 
>>>> # Long Queue
>>>> PartitionName=longq7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=168:00:00 Priority=30 Nodes=nodeamd[009-016],node[004-100]
>>>> 
>>>> 
>>>> # Interactive
>>>> PartitionName=interactive MinNodes=1 MaxNodes=4 DefaultTime=06:00:00 MaxTime=06:00:00 Priority=101 Nodes=node[001-100]  Default=No Hidden=YES
>>>> 
>>>> # Nodes
>>>> 
>>>> # Test nodes, (vms)
>>>> NodeName=c[1-4] Cpus=4 Feature=virtual RealMemory=16000
>>>> 
>>>> # AMD Nodes
>>>> NodeName=nodeamd[001-016] Procs=64 Boards=1 SocketsPerBoard=8 CoresPerSocket=8 ThreadsPerCore=1 Features=amd,epyc RealMemory=225436
>>>> 
>>>> # V100 MRI
>>>> NodeName=nodenviv100[001-016] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:v100:4 Feature=v100 RealMemory=192006
>>>> 
>>>> # GPU nodes
>>>> NodeName=nodegpu001 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=gpu:k80:8 Feature=k80,intel RealMemory=64000
>>>> NodeName=nodegpu002 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000
>>>> NodeName=nodegpu[003-020] Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000
>>>> NodeName=nodegpu[021-025] Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 Gres=gpu:4 Feature=exxact,intel RealMemory=128000
>>>> 
>>>> # IvyBridge nodes
>>>> NodeName=node[001-021] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750
>>>> # SandyBridge node(2)
>>>> NodeName=node022 Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 Feature=intel,sandybridge RealMemory=64000
>>>> # IvyBridge
>>>> NodeName=node[023-050] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750
>>>> # Haswell
>>>> NodeName=node[051-100] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,haswell RealMemory=112750
>>>> 
>>>> 
>>>> # Node health monitoring
>>>> HealthCheckProgram=/usr/sbin/nhc
>>>> HealthCheckInterval=300
>>>> ReturnToService=2
>>>> 
>>>> # Fix for X11 issues
>>>> X11Parameters=use_raw_hostname
>>>> 
>>>> 
>>>> 
>>>> Rhian Resnick
>>>> Associate Director Research Computing
>>>> Enterprise Systems
>>>> Office of Information Technology
>>>> 
>>>> Florida Atlantic University
>>>> 777 Glades Road, CM22, Rm 173B
>>>> Boca Raton, FL 33431
>>>> Phone 561.297.2647
>>>> Fax 561.297.0222
>>>> 
>> 




More information about the slurm-users mailing list