<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>How many nodes are we talking about here? What if you gave each

      node it's own gres.conf file, where all of them said</p>

    <pre class="moz-quote-pre" wrap=""><font size="+1">AutoDetect=nvml</font>

</pre>

    <p>Except the one you want to exclude, which would have this in

      gres.conf :</p>

    <pre class="moz-quote-pre" wrap=""><font size="+1">NodeName=a1-10 AutoDetect=off Name=gpu File=/dev/nvidia0

</font></pre>

    <p>It seems to me like Autodetect and Autodetect=off are exclusive

      in the same gres.conf file, but maybe my suggestion would work. If

      you have a small number of GPU nodes, or use a configuration

      management tool like Ansible, Chef, or Puppet, it might be worth a

      shot. <br>

    </p>

    <p>Prentice<br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 2/19/21 11:31 AM, Paul Brunk wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:BN8PR02MB5954B0053CD32AD507C1D3DBC3849@BN8PR02MB5954.namprd02.prod.outlook.com">

      <pre class="moz-quote-pre" wrap="">Hi all:

(I hope plague and weather are being visibly less than maximally cruel

to you all.)

In short, I was trying to exempt a node from NVML Autodetect, and

apparently introduced a syntax error in gres.conf.  This is not an

urgent matter for us now, but I'm curious what went wrong.  Thanks for

lending any eyes to this!

More info:

Slurm 20.02.6, CentOS 7.

We've historically had only this in our gres.conf:

AutoDetect=nvml

Each of our GPU nodes has e.g. 'Gres=gpu:V100:1' as part of its

NodeName entry (GPU models vary across them).

I wanted to exempt one GPU node from the autodetect (was curious about

the presence or absence of the GPU model subtype designation,

e.g. 'V100' vs. 'v100s'), so I changed gres.conf to this (modelled

after 'gres.conf' man page):

AutoDetect=nvml

NodeName=a1-10 AutoDetect=off Name=gpu File=/dev/nvidia0

I restarted slurmctld, then "scontrol reconfigure".  Each node got a

fatal error parsing gres.conf, causing RPC failure between slurmctld

and nodes, causing slurmctld to consider the nodes failed.

Here's how it looked to slurmctld:

[2021-02-04T13:36:30.482] backfill: Started JobId=1469772_3(1473148) in batch on ra3-6

[2021-02-04T15:14:48.642] error: Node ra3-6 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

[2021-02-04T15:25:40.258] agent/is_node_resp: node:ra3-6 RPC:REQUEST_PING : Communication connection failure

[2021-02-04T15:39:49.046] requeue job JobId=1443912 due to failure of node ra3-6

And to the slurmd's :

[2021-02-04T15:14:50.730] Message aggregation disabled

[2021-02-04T15:14:50.742] error: Parsing error at unrecognized key: AutoDetect

[2021-02-04T15:14:50.742] error: Parse error in file /var/lib/slurmd/conf-cache/gres.conf line 2: " AutoDetect=off Name=gpu File=/dev/nvidia0"

[2021-02-04T15:14:50.742] fatal: error opening/reading /var/lib/slurmd/conf-cache/gres.conf

Reverting to the original, one-line gres.conf reverted the cluster to production state.

</pre>

    </blockquote>

    <pre class="moz-signature" cols="72">-- 

Prentice Bisbal

Lead Software Engineer

Research Computing

Princeton Plasma Physics Laboratory

<a class="moz-txt-link-freetext" href="http://www.pppl.gov">http://www.pppl.gov</a></pre>

  </body>

</html>