Hi,
I am trying to set up Slurm (version 22.05) on a 3 node cluster each having an NVIDIA GeForce RTX 4070 Ti GPU. I tried to follow along with the GRES setup tutorial on the Schedmd website and added the following (Gres=gpu:RTX4070TI:1) to the Node configuration in /etc/slurm/slurm.conf:
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:RTX4070TI:1
I do not have a gres.conf. However, I see this line at the debug log level in /var/log/slurmd.log:
[2024-04-02T15:57:19.022] debug: Removing file-less GPU gpu:RTX4070TI from final GRES list
What other configs are necessary for Slurm to work with my GPU?
More information: OS: Proxmox VE 8.1.4 Kernel: 6.5.13 CPU: AMD EPYC 7662 Memory: 128636MiB
/etc/slurm/slurm.conf that's shared by all the 3 nodes without the comment lines:
ClusterName=DlabCluster SlurmctldHost=server1 GresTypes=gpu ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/affinity,task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres JobCompType=jobcomp/none JobAcctGatherFrequency=30 SlurmctldDebug=debug SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=debug SlurmdLogFile=/var/log/slurmd.log NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:RTX4070TI:1 PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP