[slurm-users] [External] Re: exempting a node from Gres Autodetect

Tue Feb 23 20:34:59 UTC 2021

I don't see how that bug is related. That bug is about requiring the 
libnvidia-ml.so library for an RPM that was built with NVML Autodetect 
enabled. His problem is the opposite - he's already using NVML 
autodetect, but wants to disable that feature on a single node, where it 
looks like that node isn't using RPMs with NVML support.

Prentice

On 2/19/21 3:43 PM, Robert Kudyba wrote:
> have you seen this? https://bugs.schedmd.com/show_bug.cgi?id=7919#c7 
> <https://bugs.schedmd.com/show_bug.cgi?id=7919#c7>, fixed in 20.06.1
>
> On Fri, Feb 19, 2021 at 11:34 AM Paul Brunk <pbrunk at uga.edu 
> <mailto:pbrunk at uga.edu>> wrote:
>
>     Hi all:
>
>     (I hope plague and weather are being visibly less than maximally cruel
>     to you all.)
>
>     In short, I was trying to exempt a node from NVML Autodetect, and
>     apparently introduced a syntax error in gres.conf.  This is not an
>     urgent matter for us now, but I'm curious what went wrong. Thanks for
>     lending any eyes to this!
>
>     More info:
>
>     Slurm 20.02.6, CentOS 7.
>
>     We've historically had only this in our gres.conf:
>     AutoDetect=nvml
>
>     Each of our GPU nodes has e.g. 'Gres=gpu:V100:1' as part of its
>     NodeName entry (GPU models vary across them).
>
>     I wanted to exempt one GPU node from the autodetect (was curious about
>     the presence or absence of the GPU model subtype designation,
>     e.g. 'V100' vs. 'v100s'), so I changed gres.conf to this (modelled
>     after 'gres.conf' man page):
>
>     AutoDetect=nvml
>     NodeName=a1-10 AutoDetect=off Name=gpu File=/dev/nvidia0
>
>     I restarted slurmctld, then "scontrol reconfigure".  Each node got a
>     fatal error parsing gres.conf, causing RPC failure between slurmctld
>     and nodes, causing slurmctld to consider the nodes failed.
>
>     Here's how it looked to slurmctld:
>
>     [2021-02-04T13:36:30.482] backfill: Started
>     JobId=1469772_3(1473148) in batch on ra3-6
>     [2021-02-04T15:14:48.642] error: Node ra3-6 appears to have a
>     different slurm.conf than the slurmctld.  This could cause issues
>     with communication and functionality.  Please review both files
>     and make sure they are the same.  If this is expected ignore, and
>     set DebugFlags=NO_CONF_HASH in your slurm.conf.
>     [2021-02-04T15:25:40.258] agent/is_node_resp: node:ra3-6
>     RPC:REQUEST_PING : Communication connection failure
>     [2021-02-04T15:39:49.046] requeue job JobId=1443912 due to failure
>     of node ra3-6
>
>     And to the slurmd's :
>
>     [2021-02-04T15:14:50.730] Message aggregation disabled
>     [2021-02-04T15:14:50.742] error: Parsing error at unrecognized
>     key: AutoDetect
>     [2021-02-04T15:14:50.742] error: Parse error in file
>     /var/lib/slurmd/conf-cache/gres.conf line 2: " AutoDetect=off
>     Name=gpu File=/dev/nvidia0"
>     [2021-02-04T15:14:50.742] fatal: error opening/reading
>     /var/lib/slurmd/conf-cache/gres.conf
>
>     Reverting to the original, one-line gres.conf reverted the cluster
>     to production state.
>
>     -- 
>     Paul Brunk, system administrator
>     Georgia Advanced Computing Resource Center
>     Enterprise IT Svcs, the University of Georgia
>
>
-- 
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210223/87f7e1bd/attachment.htm>