Hi,
I am trying to set up Slurm (version 22.05) on a 3 node cluster each having an NVIDIA GeForce RTX 4070 Ti GPU.
I tried to follow along with the GRES setup tutorial on the Schedmd website and added the following (Gres=gpu:RTX4070TI:1) to the Node configuration in /etc/slurm/slurm.conf:
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:RTX4070TI:1
I do not have a gres.conf.
However, I see this line at the debug log level in /var/log/slurmd.log:
[2024-04-02T15:57:19.022] debug: Removing file-less GPU gpu:RTX4070TI from final GRES list
What other configs are necessary for Slurm to work with my GPU?
More information:
OS: Proxmox VE 8.1.4
Kernel: 6.5.13
CPU: AMD EPYC 7662
Memory: 128636MiB
/etc/slurm/slurm.conf that's shared by all the 3 nodes without the comment lines:
ClusterName=DlabCluster
SlurmctldHost=server1
GresTypes=gpu
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=debug
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd.log
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:RTX4070TI:1
PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP