Assuming that you have the cuda drivers installed correctly (nvidia-smi for instance), You should create a gres.conf with just this line:
AutoDetect=nvml
If that doesn’t automagically begin working, you can increase the verbosity of slurmd with
SlurmdDebug=debug2
It should then print a bunch of logs describing any gpu’s that are found. You may need to alter the name from RTX4070TI (which is wordy as is). I’m not sure just how lax the matching engine of slurm and the nvml interface are with matching strings.
Hope that helps,
Reed
On Apr 2, 2024, at 6:08 AM, Shooktija S N via slurm-users slurm-users@lists.schedmd.com wrote:
Hi,
I am trying to set up Slurm (version 22.05) on a 3 node cluster each having an NVIDIA GeForce RTX 4070 Ti GPU. I tried to follow along with the GRES setup tutorial on the Schedmd website and added the following (Gres=gpu:RTX4070TI:1) to the Node configuration in /etc/slurm/slurm.conf:
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:RTX4070TI:1
I do not have a gres.conf. However, I see this line at the debug log level in /var/log/slurmd.log:
[2024-04-02T15:57:19.022] debug: Removing file-less GPU gpu:RTX4070TI from final GRES list
What other configs are necessary for Slurm to work with my GPU?
More information: OS: Proxmox VE 8.1.4 Kernel: 6.5.13 CPU: AMD EPYC 7662 Memory: 128636MiB
/etc/slurm/slurm.conf that's shared by all the 3 nodes without the comment lines:
ClusterName=DlabCluster SlurmctldHost=server1 GresTypes=gpu ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/affinity,task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres JobCompType=jobcomp/none JobAcctGatherFrequency=30 SlurmctldDebug=debug SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=debug SlurmdLogFile=/var/log/slurmd.log NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:RTX4070TI:1 PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com