[slurm-users] Configless mode enabling issue

Fri May 7 19:12:27 UTC 2021

Hi David,

What is the gres.conf on the controller’s /etc/slurm ? Is it autodetect via nvml?

In configless the slurm.conf, gres.conf, etc is just maintained on the controller, and the worker nodes get it from there automatically (you don’t want those files on the worker nodes.) If you need to see what the slurmd daemon is seeing/doing in real-time, start slurmd on the node via “slurmd-Dvvvv” and you will see the log mssgs on stdout. (If it normally runs via systemd, then “systemctl stop slurmd” 1st.)

Regards,
Will

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of David Henkemeyer <david.henkemeyer at gmail.com>
Sent: Friday, May 7, 2021 2:41:41 PM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Configless mode enabling issue

Hello all. My team is enabling slurm (version 20.11.5) in our environment, and we got a controller up and running, along with 2 nodes.  Everything was working fine.  However, when we try to enable configless mode, I ran into a problem.  The node that has a GPU is coming up in "drained" state, and sinfo -Nl shows the following:

(dhenkemeyer)-(devops1)-(x86_64-redhat-linux-gnu)-(~/slurm/bin)
(! 726)-> sinfo -Nl
Fri May 07 10:20:20 2021
NODELIST   NODES PARTITION       STATE CPUS    S:c:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
devops2        1    debug*        idle 4       1:4:1   9913        0      1 avx,cent none
devops3        1    debug*     drained 8       2:4:1  40213        0      1  foo,bar gres/gpu count repor

As you can see, it appears to be related to the gres/gpu count.  Here is the entry for the node, in the slurm.conf file (which is attached) on the controller:

NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=40213 Features=foo,bar Gres=gpu:kepler:1

Prior to this, we also tried a simpler way of expressing Gres:

NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=40213 Features=foo,bar Gres=gpu:1

But that also failed.I am logging on the controller, and have enabled debug output when I launch slurmd on the nodes.  On the problematic node (the one with a GPU), I am seeing this repeating message:

slurmd: debug:  Unable to register with slurm controller, retrying

and on the controller, I am seeing this repeating message:

[2021-05-07T10:23:30.417] error: _slurm_rpc_node_registration node=devops3: Invalid argument

So they are definitely related.  Any help would be appreciated.  I tried moving the slurm.conf file from the GPU node to the controller, but that caused our non-GPU node to puke on startup:

slurmd: fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm│slurmd: debug:  Unable to register with sl
 was configured.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210507/f089c63b/attachment-0001.htm>