Thank you for the response, it certainly clears up a few things, and the list of required packages is super helpful (where are these listed in the docs?).

Here are a few follow up questions:

I had installed Slurm (version 22.05) using apt by running 'apt install slurm-wlm'. Is it necessary to execute a command like 'apt-get autoremove slurm-wlm' to compile the Slurm source code from scratch, as you've described?

You have given this command as an example:
rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam --with-pam=/usr" --define="_with_pmix --with-pmix=/usr" --define="_with_hdf5 --without-hdf5" --define="_with_ofed --without-ofed" --define="_with_http_parser --with-http-parser=/usr/lib64" --define="_with_yaml  --define="_with_jwt  --define="_with_slurmrestd --with-slurmrestd=1" -ta slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date +%F` 2>&1

Are the options you've used in this example command fairly standard options for a 'general' installation of Slurm? Where can I learn more about these options to make sure that I don't miss any important options that might be necessary for the specs of my cluster?

Would I have to add the paths to the compiled binaries to the PATH or LD_LIBRARY_PATH environment variables?

My nodes are running an OS based on Debian 12 (Proxmox VE), what is the 'rpmbuild' equivalent for my OS? Would the syntax used in your example command be the same for any build tool?

Thanks!


On Wed, Apr 3, 2024 at 9:18 PM Williams, Jenny Avis <jennyw@email.unc.edu> wrote:

Slurm source code should be downloaded and recompiled including the configuration flag – with-nvml.

 

 

As an example, using rpmbuild mechanism for recompiling and generating rpms, this is our current method.  Be aware that the compile works only if it finds the prerequisites needed for a given option on the host. (* e.g. to recompile this –with-nvml you should do so on a functioning gpu host *) 

 

========

 

export VERSION=23.11.5

 

 

wget https://download.schedmd.com/slurm/slurm-$VERSION.tar.bz2

#

rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam --with-pam=/usr" --define="_with_pmix --with-pmix=/usr" --define="_with_hdf5 --without-hdf5" --define="_with_ofed --without-ofed" --define="_with_http_parser --with-http-parser=/usr/lib64" --define="_with_yaml  --define="_with_jwt  --define="_with_slurmrestd --with-slurmrestd=1" -ta slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date +%F` 2>&1

 

 

This is a list of packages we ensure are installed on a given node when running this compile . 

 

    - pkgs:

      - bzip2

      - cuda-nvml-devel-12-2

      - dbus-devel

      - freeipmi

      - freeipmi-devel

      - gcc

      - gtk2-devel

      - hwloc-devel

      - libjwt-devel

      - libssh2-devel

      - libyaml-devel

      - lua-devel

      - make

      - mariadb-devel

      - munge-devel

      - munge-libs

      - ncurses-devel

      - numactl-devel

      - openssl-devel

      - pam-devel

      - perl

      - perl-ExtUtils-MakeMaker

      - readline-devel

      - rpm-build

      - rpmdevtools

      - rrdtool-devel

      - http-parser-devel

      - json-c-devel

 

From: Shooktija S N via slurm-users <slurm-users@lists.schedmd.com>
Sent: Wednesday, April 3, 2024 7:01 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] How to reinstall / reconfigure Slurm?

 

Hi,

 

I am setting up Slurm on our lab's 3 node cluster and I have run into a problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES. There is an error at the 'debug' log level in slurmd.log that says that the GPU is file-less and is being removed from the final GRES list. This error according to some older posts on this forum might be fixed by reinstalling / reconfiguring Slurm with the right flag (the '--with-nvml' flag according to this post).

 

Line in /var/log/slurmd.log:

[2024-04-03T15:42:02.695] debug:  Removing file-less GPU gpu:rtx4070 from final GRES list

 

Does this error require me to either reinstall / reconfigure Slurm? What does 'reconfigure Slurm' mean?

I'm about as clueless as a caveman with a smartphone when it comes to Slurm administration and Linux system administration in general. So, if you could, please explain it to me as simply as possible.

 

slurm.conf without comment lines:

ClusterName=DlabCluster
SlurmctldHost=server1
GresTypes=gpu
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=debug2
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug2
SlurmdLogFile=/var/log/slurmd.log
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1
PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP

 

gres.conf (only one line):

AutoDetect=nvml

 

While installing cuda, I know that nvml has been installed because of this line in /var/log/cuda-installer.log:

[INFO]: Installing: cuda-nvml-dev

 

Thanks!

 

PS: I could've added this as a continuation to this post, but for some reason I do not have permission to post to that group, so here I am starting a new thread.