[slurm-users] Two gpu types on one node: gres/gpu count reported lower than configured (1 < 5)

Feng Zhang prod.feng at gmail.com
Mon Oct 16 14:53:18 UTC 2023


Try

scontrol update NodeName=heimdall state=DOWN Reason="gpu issue"

and then

scontrol update NodeName=heimdall state=RESUME

to see if it will work. Probably just SLURM daemon having a hiccup
after you made changes.

Best,

Feng

On Mon, Oct 16, 2023 at 10:43 AM Gregor Hagelueken
<hagelueken at uni-bonn.de> wrote:
>
> Hi,
>
> We have a ubuntu server (22.04) with currently 5 GPUs (1 x l40 and 4 x rtx_a5000).
> I am trying to configure slurm such that a user can select either the l40 or a5000 gpus for a particular job.
> I have configured my slurm.conf and gres.conf files similar as in this old thread:
> https://groups.google.com/g/slurm-users/c/fc-eoHpTNwU
> I have pasted the contents of the two files below.
>
> Unfortunately, my node is always on “drain” and scontrol shows this error:
> Reason=gres/gpu count reported lower than configured (1 < 5)
>
> Any idea what I am doing wrong?
> Cheers and thanks for your help!
> Gregor
>
> Here are my slurm.conf and gres.conf files.
>
> AutoDetect=off
> NodeName=heimdall Name=gpu Type=l40  File=/dev/nvidia0
> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia1
> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia2
> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia3
> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia4
>
>
> # slurm.conf file generated by configurator.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> SlurmdDebug=debug2
> #
> ClusterName=heimdall
> SlurmctldHost=localhost
> MpiDefault=none
> ProctrackType=proctrack/linuxproc
> ReturnToService=2
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/lib/slurm/slurmd
> SlurmUser=slurm
> StateSaveLocation=/var/lib/slurm/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/none
> #
> # TIMERS
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waittime=0
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core
> GresTypes=gpu
> #
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/none
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=info
> SlurmctldLogFile=/var/log/slurm/slurmctld.log
> SlurmdDebug=info
> SlurmdLogFile=/var/log/slurm/slurmd.log
> #
> # COMPUTE NODES
> NodeName=heimdall CPUs=128 Gres=gpu:l40:1,gpu:a5000:4 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=773635 State=UNKNOWN
> PartitionName=heimdall Nodes=ALL Default=YES MaxTime=INFINITE State=UP DefMemPerCPU=8000 DefCpuPerGPU=16
>
>



More information about the slurm-users mailing list