[slurm-users] Two gpu types on one node: gres/gpu count reported lower than configured (1 < 5)

Wed Oct 18 04:35:45 UTC 2023

Dear Feng,
That worked! Thank you!
Cheers 
Gregor

Sent from my iPhone.

> Am 16.10.2023 um 17:05 schrieb Feng Zhang <prod.feng at gmail.com>:
> 
> Try
> 
> scontrol update NodeName=heimdall state=DOWN Reason="gpu issue"
> 
> and then
> 
> scontrol update NodeName=heimdall state=RESUME
> 
> to see if it will work. Probably just SLURM daemon having a hiccup
> after you made changes.
> 
> Best,
> 
> Feng
> 
>> On Mon, Oct 16, 2023 at 10:43 AM Gregor Hagelueken
>> <hagelueken at uni-bonn.de> wrote:
>> 
>> Hi,
>> 
>> We have a ubuntu server (22.04) with currently 5 GPUs (1 x l40 and 4 x rtx_a5000).
>> I am trying to configure slurm such that a user can select either the l40 or a5000 gpus for a particular job.
>> I have configured my slurm.conf and gres.conf files similar as in this old thread:
>> https://groups.google.com/g/slurm-users/c/fc-eoHpTNwU
>> I have pasted the contents of the two files below.
>> 
>> Unfortunately, my node is always on “drain” and scontrol shows this error:
>> Reason=gres/gpu count reported lower than configured (1 < 5)
>> 
>> Any idea what I am doing wrong?
>> Cheers and thanks for your help!
>> Gregor
>> 
>> Here are my slurm.conf and gres.conf files.
>> 
>> AutoDetect=off
>> NodeName=heimdall Name=gpu Type=l40  File=/dev/nvidia0
>> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia1
>> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia2
>> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia3
>> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia4
>> 
>> 
>> # slurm.conf file generated by configurator.html.
>> # Put this file on all nodes of your cluster.
>> # See the slurm.conf man page for more information.
>> #
>> SlurmdDebug=debug2
>> #
>> ClusterName=heimdall
>> SlurmctldHost=localhost
>> MpiDefault=none
>> ProctrackType=proctrack/linuxproc
>> ReturnToService=2
>> SlurmctldPidFile=/var/run/slurmctld.pid
>> SlurmctldPort=6817
>> SlurmdPidFile=/var/run/slurmd.pid
>> SlurmdPort=6818
>> SlurmdSpoolDir=/var/lib/slurm/slurmd
>> SlurmUser=slurm
>> StateSaveLocation=/var/lib/slurm/slurmctld
>> SwitchType=switch/none
>> TaskPlugin=task/none
>> #
>> # TIMERS
>> InactiveLimit=0
>> KillWait=30
>> MinJobAge=300
>> SlurmctldTimeout=120
>> SlurmdTimeout=300
>> Waittime=0
>> # SCHEDULING
>> SchedulerType=sched/backfill
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core
>> GresTypes=gpu
>> #
>> #AccountingStoragePort=
>> AccountingStorageType=accounting_storage/none
>> JobCompType=jobcomp/none
>> JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_gather/none
>> SlurmctldDebug=info
>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>> SlurmdDebug=info
>> SlurmdLogFile=/var/log/slurm/slurmd.log
>> #
>> # COMPUTE NODES
>> NodeName=heimdall CPUs=128 Gres=gpu:l40:1,gpu:a5000:4 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=773635 State=UNKNOWN
>> PartitionName=heimdall Nodes=ALL Default=YES MaxTime=INFINITE State=UP DefMemPerCPU=8000 DefCpuPerGPU=16
>> 
>> 
>