[slurm-users] Configless mode enabling issue

David Henkemeyer david.henkemeyer at gmail.com
Fri May 7 18:41:41 UTC 2021


Hello all. My team is enabling slurm (version 20.11.5) in our environment,
and we got a controller up and running, along with 2 nodes.  Everything was
working fine.  However, when we try to enable configless mode, I ran into a
problem.  The node that has a GPU is coming up in "drained" state, and
sinfo -Nl shows the following:

(dhenkemeyer)-(devops1)-(x86_64-redhat-linux-gnu)-(~/slurm/bin)
(! 726)-> sinfo -Nl
Fri May 07 10:20:20 2021
NODELIST   NODES PARTITION       STATE CPUS    S:c:T MEMORY TMP_DISK
WEIGHT AVAIL_FE REASON
devops2        1    debug*        idle 4       1:4:1   9913        0
   1 avx,cent none
devops3        1    debug*     drained 8       2:4:1  40213        0
   1  foo,bar gres/gpu count repor

As you can see, it appears to be related to the gres/gpu count.  Here is
the entry for the node, in the slurm.conf file (which is attached) on the
controller:

NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1
RealMemory=40213 Features=foo,bar Gres=gpu:kepler:1

Prior to this, we also tried a simpler way of expressing Gres:

NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1
RealMemory=40213 Features=foo,bar Gres=gpu:1

But that also failed.I am logging on the controller, and have enabled debug
output when I launch slurmd on the nodes.  On the problematic node (the one
with a GPU), I am seeing this repeating message:

slurmd: debug:  Unable to register with slurm controller, retrying

and on the controller, I am seeing this repeating message:

[2021-05-07T10:23:30.417] error: _slurm_rpc_node_registration
node=devops3: Invalid argument

So they are definitely related.  Any help would be appreciated.  I tried
moving the slurm.conf file from the GPU node to the controller, but that
caused our non-GPU node to puke on startup:

slurmd: fatal: We were configured to autodetect nvml functionality,
but we weren't able to find that lib when Slurm│slurmd: debug:  Unable
to register with sl
 was configured.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210507/ae9f87d0/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 3780 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210507/ae9f87d0/attachment.obj>


More information about the slurm-users mailing list