[slurm-users] sbatch mem-per-gpu and gres interaction
christof.koehler at bccms.uni-bremen.de
christof.koehler at bccms.uni-bremen.de
Thu May 25 18:54:12 UTC 2023
Hello everybody,
I am observing an interaction between the --mem-per-gpu, --cpus-per-gpu
and --gres settings in sbatch which I do not understand.
Basically, if the job is submitted with --gres=gpu:2 the --mem-per-gpu
and --cpus-per-gpu settings appear to be observed. If the job is
submitted with --gres=gpu:a100:2 the settings appear to be ignored and
partition defaults are used instead.
First the partition definition from slurm.conf (slurm 23.02.2) and then
a demonstration:
PartitionName=gpu Nodes=gpu[001-004] MaxTime=24:00:00 DefMemPerGPU=124000
DefCpuPerGPU=12 TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=50.0,Gres/gpu:a100=50.0"
State=UP
Submitting this jobscript:
#!/bin/bash
#SBATCH --cpus-per-gpu=4
#SBATCH --mem-per-gpu=5000M
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2
#SBATCH --time=00:20:00
sleep 60
gives
$ scontrol show job=812|grep -i tres
ReqTRES=cpu=1,mem=10000M,node=1,billing=103,gres/gpu=2
AllocTRES=cpu=8,mem=10000M,node=1,billing=210,gres/gpu=2,gres/gpu:a100=2
CpusPerTres=gres:gpu:4
MemPerTres=gres:gpu:5000
TresPerNode=gres:gpu:2
Changing "--gres=gpu:2" to "--gres=gpu:a100:2" however gives
control show job=813|grep -i tres
ReqTRES=cpu=1,mem=500000M,node=1,billing=323,gres/gpu=2,gres/gpu:a100=2
AllocTRES=cpu=24,mem=248000M,node=1,billing=284,gres/gpu=2,gres/gpu:a100=2
CpusPerTres=gpu:12
MemPerTres=gpu:124000
TresPerNode=gres:gpu:a100:2
So, if "--gres=gpu:2" the settings fron --mem-per-gpu and --cpus-per-gpu
are use. But if "--gres=gpu:a100:2" the partition default values are
used.
If I change the partition definition to
PartitionName=gpu Nodes=gpu[001-004] MaxTime=24:00:00 DefMemPerCPU=4096
MaxMemPerCPU=10200 DefCpuPerGPU=12
TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=50.0,Gres/gpu:a100=50.0"
State=UP
we observe the same behaviour. With "--gres=gpu:a100:2" partition
default values for memory and number of cpus are used instead of
the values supplied in the jobscript.
I did not find anything describing such an interaction in the
documentation. Is what we observe the expected behaviour for some
reason? Or is there a problem with our configuration?
Best Regards
Christof
--
Dr. rer. nat. Christof Köhler email: c.koehler at uni-bremen.de
Universitaet Bremen/FB1/BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.06 fax: +49-(0)421-218-62770
28359 Bremen
More information about the slurm-users
mailing list