[slurm-users] Configless mode enabling issue

Sat May 8 00:31:16 UTC 2021

Thank you for the reply, Will!

The slurm.conf file only has one line in it:

AutoDetect=nvml

During my debug, I copied this file from the GPU node to the controller.
But, that's when I noticed that the node w/o a GPU then crashed on startup.

David

On Fri, May 7, 2021 at 12:14 PM Will Dennis <wdennis at nec-labs.com> wrote:

> Hi David,
>
> What is the gres.conf on the controller’s /etc/slurm ? Is it autodetect
> via nvml?
>
> In configless the slurm.conf, gres.conf, etc is just maintained on the
> controller, and the worker nodes get it from there automatically (you don’t
> want those files on the worker nodes.) If you need to see what the slurmd
> daemon is seeing/doing in real-time, start slurmd on the node via
> “slurmd-Dvvvv” and you will see the log mssgs on stdout. (If it normally
> runs via systemd, then “systemctl stop slurmd” 1st.)
>
> Regards,
> Will
>
>
> ------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> David Henkemeyer <david.henkemeyer at gmail.com>
> *Sent:* Friday, May 7, 2021 2:41:41 PM
> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] Configless mode enabling issue
>
> Hello all. My team is enabling slurm (version 20.11.5) in our environment,
> and we got a controller up and running, along with 2 nodes.  Everything was
> working fine.  However, when we try to enable configless mode, I ran into a
> problem.  The node that has a GPU is coming up in "drained" state, and
> sinfo -Nl shows the following:
>
> (dhenkemeyer)-(devops1)-(x86_64-redhat-linux-gnu)-(~/slurm/bin)
> (! 726)-> sinfo -Nl
> Fri May 07 10:20:20 2021
> NODELIST   NODES PARTITION       STATE CPUS    S:c:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
> devops2        1    debug*        idle 4       1:4:1   9913        0      1 avx,cent none
> devops3        1    debug*     drained 8       2:4:1  40213        0      1  foo,bar gres/gpu count repor
>
> As you can see, it appears to be related to the gres/gpu count.  Here is
> the entry for the node, in the slurm.conf file (which is attached) on the
> controller:
>
> NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=40213 Features=foo,bar Gres=gpu:kepler:1
>
> Prior to this, we also tried a simpler way of expressing Gres:
>
> NodeName=devops3 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=40213 Features=foo,bar Gres=gpu:1
>
> But that also failed.I am logging on the controller, and have enabled
> debug output when I launch slurmd on the nodes.  On the problematic node
> (the one with a GPU), I am seeing this repeating message:
>
> slurmd: debug:  Unable to register with slurm controller, retrying
>
> and on the controller, I am seeing this repeating message:
>
> [2021-05-07T10:23:30.417] error: _slurm_rpc_node_registration node=devops3: Invalid argument
>
> So they are definitely related.  Any help would be appreciated.  I tried
> moving the slurm.conf file from the GPU node to the controller, but that
> caused our non-GPU node to puke on startup:
>
> slurmd: fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm│slurmd: debug:  Unable to register with sl
>  was configured.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210507/bfc1f3ee/attachment.htm>