Slurm source code should be downloaded and recompiled including the configuration flag - with-nvml.
As an example, using rpmbuild mechanism for recompiling and generating rpms, this is our current method. Be aware that the compile works only if it finds the prerequisites needed for a given option on the host. (* e.g. to recompile this -with-nvml you should do so on a functioning gpu host *)
========
export VERSION=23.11.5
wget https://download.schedmd.com/slurm/slurm-$VERSION.tar.bz2 # rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam --with-pam=/usr" --define="_with_pmix --with-pmix=/usr" --define="_with_hdf5 --without-hdf5" --define="_with_ofed --without-ofed" --define="_with_http_parser --with-http-parser=/usr/lib64" --define="_with_yaml --define="_with_jwt --define="_with_slurmrestd --with-slurmrestd=1" -ta slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date +%F` 2>&1
This is a list of packages we ensure are installed on a given node when running this compile .
- pkgs: - bzip2 - cuda-nvml-devel-12-2 - dbus-devel - freeipmi - freeipmi-devel - gcc - gtk2-devel - hwloc-devel - libjwt-devel - libssh2-devel - libyaml-devel - lua-devel - make - mariadb-devel - munge-devel - munge-libs - ncurses-devel - numactl-devel - openssl-devel - pam-devel - perl - perl-ExtUtils-MakeMaker - readline-devel - rpm-build - rpmdevtools - rrdtool-devel - http-parser-devel - json-c-devel
From: Shooktija S N via slurm-users slurm-users@lists.schedmd.com Sent: Wednesday, April 3, 2024 7:01 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] How to reinstall / reconfigure Slurm?
Hi,
I am setting up Slurm on our lab's 3 node cluster and I have run into a problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES. There is an error at the 'debug' log level in slurmd.log that says that the GPU is file-less and is being removed from the final GRES list. This error according to some older posts on this forum might be fixed by reinstalling / reconfiguring Slurm with the right flag (the '--with-nvml' flag according to thishttps://groups.google.com/g/slurm-users/c/cvGb4JnK8BY post).
Line in /var/log/slurmd.log: [2024-04-03T15:42:02.695] debug: Removing file-less GPU gpu:rtx4070 from final GRES list
Does this error require me to either reinstall / reconfigure Slurm? What does 'reconfigure Slurm' mean? I'm about as clueless as a caveman with a smartphone when it comes to Slurm administration and Linux system administration in general. So, if you could, please explain it to me as simply as possible.
slurm.conf without comment lines: ClusterName=DlabCluster SlurmctldHost=server1 GresTypes=gpu ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/affinity,task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres JobCompType=jobcomp/none JobAcctGatherFrequency=30 SlurmctldDebug=debug2 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=debug2 SlurmdLogFile=/var/log/slurmd.log NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1 PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
gres.conf (only one line): AutoDetect=nvml
While installing cuda, I know that nvml has been installed because of this line in /var/log/cuda-installer.log: [INFO]: Installing: cuda-nvml-dev
Thanks!
PS: I could've added this as a continuation to this posthttps://groups.google.com/g/slurm-users/c/p68dkeUoMmA, but for some reason I do not have permission to post to that group, so here I am starting a new thread.