[slurm-users] Two gpu types on one node: gres/gpu count reported lower than configured (1 < 5)

Gregor Hagelueken hagelueken at uni-bonn.de
Mon Oct 16 14:39:20 UTC 2023


Hi,

We have a ubuntu server (22.04) with currently 5 GPUs (1 x l40 and 4 x rtx_a5000).
I am trying to configure slurm such that a user can select either the l40 or a5000 gpus for a particular job.
I have configured my slurm.conf and gres.conf files similar as in this old thread:
https://groups.google.com/g/slurm-users/c/fc-eoHpTNwU
I have pasted the contents of the two files below.
 
Unfortunately, my node is always on “drain” and scontrol shows this error:
Reason=gres/gpu count reported lower than configured (1 < 5)

Any idea what I am doing wrong?
Cheers and thanks for your help!
Gregor

Here are my slurm.conf and gres.conf files.
> AutoDetect=off
> NodeName=heimdall Name=gpu Type=l40  File=/dev/nvidia0
> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia1
> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia2
> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia3
> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia4

> # slurm.conf file generated by configurator.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> SlurmdDebug=debug2
> #
> ClusterName=heimdall
> SlurmctldHost=localhost
> MpiDefault=none
> ProctrackType=proctrack/linuxproc
> ReturnToService=2
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/lib/slurm/slurmd
> SlurmUser=slurm
> StateSaveLocation=/var/lib/slurm/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/none
> #
> # TIMERS
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waittime=0
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core
> GresTypes=gpu
> #
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/none
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=info
> SlurmctldLogFile=/var/log/slurm/slurmctld.log
> SlurmdDebug=info
> SlurmdLogFile=/var/log/slurm/slurmd.log
> #
> # COMPUTE NODES
> NodeName=heimdall CPUs=128 Gres=gpu:l40:1,gpu:a5000:4 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=773635 State=UNKNOWN
> PartitionName=heimdall Nodes=ALL Default=YES MaxTime=INFINITE State=UP DefMemPerCPU=8000 DefCpuPerGPU=16


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231016/37f809d3/attachment.htm>


More information about the slurm-users mailing list