Hi,
I am setting up Slurm on our lab's 3 node cluster and I have run into a problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES. There is an error at the 'debug' log level in slurmd.log that says that the GPU is file-less and is being removed from the final GRES list. This error according to some older posts on this forum might be fixed by reinstalling / reconfiguring Slurm with the right flag (the '--with-nvml' flag according to this https://groups.google.com/g/slurm-users/c/cvGb4JnK8BY post).
Line in /var/log/slurmd.log: [2024-04-03T15:42:02.695] debug: Removing file-less GPU gpu:rtx4070 from final GRES list
Does this error require me to either reinstall / reconfigure Slurm? What does 'reconfigure Slurm' mean? I'm about as clueless as a caveman with a smartphone when it comes to Slurm administration and Linux system administration in general. So, if you could, please explain it to me as simply as possible.
slurm.conf without comment lines: ClusterName=DlabCluster SlurmctldHost=server1 GresTypes=gpu ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/affinity,task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres JobCompType=jobcomp/none JobAcctGatherFrequency=30 SlurmctldDebug=debug2 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=debug2 SlurmdLogFile=/var/log/slurmd.log NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1 PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
gres.conf (only one line): AutoDetect=nvml
While installing cuda, I know that nvml has been installed because of this line in /var/log/cuda-installer.log: [INFO]: Installing: cuda-nvml-dev
Thanks!
PS: I could've added this as a continuation to this post https://groups.google.com/g/slurm-users/c/p68dkeUoMmA, but for some reason I do not have permission to post to that group, so here I am starting a new thread.
Slurm source code should be downloaded and recompiled including the configuration flag - with-nvml.
As an example, using rpmbuild mechanism for recompiling and generating rpms, this is our current method. Be aware that the compile works only if it finds the prerequisites needed for a given option on the host. (* e.g. to recompile this -with-nvml you should do so on a functioning gpu host *)
========
export VERSION=23.11.5
wget https://download.schedmd.com/slurm/slurm-$VERSION.tar.bz2 # rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam --with-pam=/usr" --define="_with_pmix --with-pmix=/usr" --define="_with_hdf5 --without-hdf5" --define="_with_ofed --without-ofed" --define="_with_http_parser --with-http-parser=/usr/lib64" --define="_with_yaml --define="_with_jwt --define="_with_slurmrestd --with-slurmrestd=1" -ta slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date +%F` 2>&1
This is a list of packages we ensure are installed on a given node when running this compile .
- pkgs: - bzip2 - cuda-nvml-devel-12-2 - dbus-devel - freeipmi - freeipmi-devel - gcc - gtk2-devel - hwloc-devel - libjwt-devel - libssh2-devel - libyaml-devel - lua-devel - make - mariadb-devel - munge-devel - munge-libs - ncurses-devel - numactl-devel - openssl-devel - pam-devel - perl - perl-ExtUtils-MakeMaker - readline-devel - rpm-build - rpmdevtools - rrdtool-devel - http-parser-devel - json-c-devel
From: Shooktija S N via slurm-users slurm-users@lists.schedmd.com Sent: Wednesday, April 3, 2024 7:01 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] How to reinstall / reconfigure Slurm?
Hi,
I am setting up Slurm on our lab's 3 node cluster and I have run into a problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES. There is an error at the 'debug' log level in slurmd.log that says that the GPU is file-less and is being removed from the final GRES list. This error according to some older posts on this forum might be fixed by reinstalling / reconfiguring Slurm with the right flag (the '--with-nvml' flag according to thishttps://groups.google.com/g/slurm-users/c/cvGb4JnK8BY post).
Line in /var/log/slurmd.log: [2024-04-03T15:42:02.695] debug: Removing file-less GPU gpu:rtx4070 from final GRES list
Does this error require me to either reinstall / reconfigure Slurm? What does 'reconfigure Slurm' mean? I'm about as clueless as a caveman with a smartphone when it comes to Slurm administration and Linux system administration in general. So, if you could, please explain it to me as simply as possible.
slurm.conf without comment lines: ClusterName=DlabCluster SlurmctldHost=server1 GresTypes=gpu ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/affinity,task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres JobCompType=jobcomp/none JobAcctGatherFrequency=30 SlurmctldDebug=debug2 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=debug2 SlurmdLogFile=/var/log/slurmd.log NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1 PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
gres.conf (only one line): AutoDetect=nvml
While installing cuda, I know that nvml has been installed because of this line in /var/log/cuda-installer.log: [INFO]: Installing: cuda-nvml-dev
Thanks!
PS: I could've added this as a continuation to this posthttps://groups.google.com/g/slurm-users/c/p68dkeUoMmA, but for some reason I do not have permission to post to that group, so here I am starting a new thread.
Thank you for the response, it certainly clears up a few things, and the list of required packages is super helpful (where are these listed in the docs?).
Here are a few follow up questions:
I had installed Slurm (version 22.05) using apt by running 'apt install slurm-wlm'. Is it necessary to execute a command like 'apt-get autoremove slurm-wlm' to compile the Slurm source code from scratch, as you've described?
You have given this command as an example: rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam --with-pam=/usr" --define="_with_pmix --with-pmix=/usr" --define="_with_hdf5 --without-hdf5" --define="_with_ofed --without-ofed" --define="_with_http_parser --with-http-parser=/usr/lib64" --define="_with_yaml --define="_with_jwt --define="_with_slurmrestd --with-slurmrestd=1" -ta slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date +%F` 2>&1
Are the options you've used in this example command fairly standard options for a 'general' installation of Slurm? Where can I learn more about these options to make sure that I don't miss any important options that might be necessary for the specs of my cluster?
Would I have to add the paths to the compiled binaries to the PATH or LD_LIBRARY_PATH environment variables?
My nodes are running an OS based on Debian 12 (Proxmox VE), what is the 'rpmbuild' equivalent for my OS? Would the syntax used in your example command be the same for any build tool?
Thanks!
On Wed, Apr 3, 2024 at 9:18 PM Williams, Jenny Avis jennyw@email.unc.edu wrote:
Slurm source code should be downloaded and recompiled including the configuration flag – with-nvml.
As an example, using rpmbuild mechanism for recompiling and generating rpms, this is our current method. Be aware that the compile works only if it finds the prerequisites needed for a given option on the host. (* e.g. to recompile this –with-nvml you should do so on a functioning gpu host *)
========
export VERSION=23.11.5
wget https://download.schedmd.com/slurm/slurm-$VERSION.tar.bz2
#
rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam --with-pam=/usr" --define="_with_pmix --with-pmix=/usr" --define="_with_hdf5 --without-hdf5" --define="_with_ofed --without-ofed" --define="_with_http_parser --with-http-parser=/usr/lib64" --define="_with_yaml --define="_with_jwt --define="_with_slurmrestd --with-slurmrestd=1" -ta slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date +%F` 2>&1
This is a list of packages we ensure are installed on a given node when running this compile .
- pkgs: - bzip2 - cuda-nvml-devel-12-2 - dbus-devel - freeipmi - freeipmi-devel - gcc - gtk2-devel - hwloc-devel - libjwt-devel - libssh2-devel - libyaml-devel - lua-devel - make - mariadb-devel - munge-devel - munge-libs - ncurses-devel - numactl-devel - openssl-devel - pam-devel - perl - perl-ExtUtils-MakeMaker - readline-devel - rpm-build - rpmdevtools - rrdtool-devel - http-parser-devel - json-c-devel
*From:* Shooktija S N via slurm-users slurm-users@lists.schedmd.com *Sent:* Wednesday, April 3, 2024 7:01 AM *To:* slurm-users@lists.schedmd.com *Subject:* [slurm-users] How to reinstall / reconfigure Slurm?
Hi,
I am setting up Slurm on our lab's 3 node cluster and I have run into a problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES. There is an error at the 'debug' log level in slurmd.log that says that the GPU is file-less and is being removed from the final GRES list. This error according to some older posts on this forum might be fixed by reinstalling / reconfiguring Slurm with the right flag (the '--with-nvml' flag according to this https://groups.google.com/g/slurm-users/c/cvGb4JnK8BY post).
Line in /var/log/slurmd.log:
[2024-04-03T15:42:02.695] debug: Removing file-less GPU gpu:rtx4070 from final GRES list
Does this error require me to either reinstall / reconfigure Slurm? What does 'reconfigure Slurm' mean?
I'm about as clueless as a caveman with a smartphone when it comes to Slurm administration and Linux system administration in general. So, if you could, please explain it to me as simply as possible.
slurm.conf without comment lines:
ClusterName=DlabCluster SlurmctldHost=server1 GresTypes=gpu ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/affinity,task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres JobCompType=jobcomp/none JobAcctGatherFrequency=30 SlurmctldDebug=debug2 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=debug2 SlurmdLogFile=/var/log/slurmd.log NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1 PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
gres.conf (only one line):
AutoDetect=nvml
While installing cuda, I know that nvml has been installed because of this line in /var/log/cuda-installer.log:
[INFO]: Installing: cuda-nvml-dev
Thanks!
PS: I could've added this as a continuation to this post https://groups.google.com/g/slurm-users/c/p68dkeUoMmA, but for some reason I do not have permission to post to that group, so here I am starting a new thread.
Follow up: I was able to fix my problem following advice in this post https://blog.devops.dev/slurm-complete-guide-a-to-z-concepts-setup-and-trouble-shooting-for-admins-8dc5034ed65b which said that the GPU GRES could be manually configured (no autodetect) by adding a line like this: 'NodeName=slurmnode Name=gpu File=/dev/nvidia0' to gres.conf
On Wed, Apr 3, 2024 at 4:30 PM Shooktija S N shooktijasn@gmail.com wrote:
Hi,
I am setting up Slurm on our lab's 3 node cluster and I have run into a problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES. There is an error at the 'debug' log level in slurmd.log that says that the GPU is file-less and is being removed from the final GRES list. This error according to some older posts on this forum might be fixed by reinstalling / reconfiguring Slurm with the right flag (the '--with-nvml' flag according to this https://groups.google.com/g/slurm-users/c/cvGb4JnK8BY post).
Line in /var/log/slurmd.log: [2024-04-03T15:42:02.695] debug: Removing file-less GPU gpu:rtx4070 from final GRES list
Does this error require me to either reinstall / reconfigure Slurm? What does 'reconfigure Slurm' mean? I'm about as clueless as a caveman with a smartphone when it comes to Slurm administration and Linux system administration in general. So, if you could, please explain it to me as simply as possible.
slurm.conf without comment lines: ClusterName=DlabCluster SlurmctldHost=server1 GresTypes=gpu ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/affinity,task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres JobCompType=jobcomp/none JobAcctGatherFrequency=30 SlurmctldDebug=debug2 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=debug2 SlurmdLogFile=/var/log/slurmd.log NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1 PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
gres.conf (only one line): AutoDetect=nvml
While installing cuda, I know that nvml has been installed because of this line in /var/log/cuda-installer.log: [INFO]: Installing: cuda-nvml-dev
Thanks!
PS: I could've added this as a continuation to this post https://groups.google.com/g/slurm-users/c/p68dkeUoMmA, but for some reason I do not have permission to post to that group, so here I am starting a new thread.