[slurm-users] Configless mode enabling issue
David Henkemeyer
david.henkemeyer at gmail.com
Fri May 7 18:41:41 UTC 2021
Hello all. My team is enabling slurm (version 20.11.5) in our environment,
and we got a controller up and running, along with 2 nodes. Everything was
working fine. However, when we try to enable configless mode, I ran into a
problem. The node that has a GPU is coming up in "drained" state, and
sinfo -Nl shows the following:
(dhenkemeyer)-(devops1)-(x86_64-redhat-linux-gnu)-(~/slurm/bin)
(! 726)-> sinfo -Nl
Fri May 07 10:20:20 2021
NODELIST NODES PARTITION STATE CPUS S:c:T MEMORY TMP_DISK
WEIGHT AVAIL_FE REASON
devops2 1 debug* idle 4 1:4:1 9913 0
1 avx,cent none
devops3 1 debug* drained 8 2:4:1 40213 0
1 foo,bar gres/gpu count repor
As you can see, it appears to be related to the gres/gpu count. Here is
the entry for the node, in the slurm.conf file (which is attached) on the
controller:
NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1
RealMemory=40213 Features=foo,bar Gres=gpu:kepler:1
Prior to this, we also tried a simpler way of expressing Gres:
NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1
RealMemory=40213 Features=foo,bar Gres=gpu:1
But that also failed.I am logging on the controller, and have enabled debug
output when I launch slurmd on the nodes. On the problematic node (the one
with a GPU), I am seeing this repeating message:
slurmd: debug: Unable to register with slurm controller, retrying
and on the controller, I am seeing this repeating message:
[2021-05-07T10:23:30.417] error: _slurm_rpc_node_registration
node=devops3: Invalid argument
So they are definitely related. Any help would be appreciated. I tried
moving the slurm.conf file from the GPU node to the controller, but that
caused our non-GPU node to puke on startup:
slurmd: fatal: We were configured to autodetect nvml functionality,
but we weren't able to find that lib when Slurm│slurmd: debug: Unable
to register with sl
was configured.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210507/ae9f87d0/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 3780 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210507/ae9f87d0/attachment.obj>
More information about the slurm-users
mailing list