[slurm-users] Re: How to reinstall / reconfigure Slurm?

8 Apr 2024


      Follow up:
I was able to fix my problem following advice in this post
https://blog.devops.dev/slurm-complete-guide-a-to-z-concepts-setup-and-trouble-shooting-for-admins-8dc5034ed65b
which
said that the GPU GRES could be manually configured (no autodetect) by
adding a line like this: 'NodeName=slurmnode Name=gpu File=/dev/nvidia0' to
gres.conf
On Wed, Apr 3, 2024 at 4:30 PM Shooktija S N shooktijasn@gmail.com wrote:
...
Hi,
I am setting up Slurm on our lab's 3 node cluster and I have run into a
problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES.
There is an error at the 'debug' log level in slurmd.log that says that the
GPU is file-less and is being removed from the final GRES list. This error
according to some older posts on this forum might be fixed by reinstalling
/ reconfiguring Slurm with the right flag (the '--with-nvml' flag according
to this https://groups.google.com/g/slurm-users/c/cvGb4JnK8BY post).
Line in /var/log/slurmd.log:
[2024-04-03T15:42:02.695] debug:  Removing file-less GPU gpu:rtx4070 from
final GRES list
Does this error require me to either reinstall / reconfigure Slurm? What
does 'reconfigure Slurm' mean?
I'm about as clueless as a caveman with a smartphone when it comes to
Slurm administration and Linux system administration in general. So, if you
could, please explain it to me as simply as possible.
slurm.conf without comment lines:
ClusterName=DlabCluster
SlurmctldHost=server1
GresTypes=gpu
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=debug2
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug2
SlurmdLogFile=/var/log/slurmd.log
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64
ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1
PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
gres.conf (only one line):
AutoDetect=nvml
While installing cuda, I know that nvml has been installed because of this
line in /var/log/cuda-installer.log:
[INFO]: Installing: cuda-nvml-dev
Thanks!
PS: I could've added this as a continuation to this post
https://groups.google.com/g/slurm-users/c/p68dkeUoMmA, but for some
reason I do not have permission to post to that group, so here I am
starting a new thread.

2025

2024

[slurm-users] Re: How to reinstall / reconfigure Slurm?