[slurm-users] [External] Re: exempting a node from Gres Autodetect
Prentice Bisbal
pbisbal at pppl.gov
Tue Feb 23 20:34:59 UTC 2021
I don't see how that bug is related. That bug is about requiring the
libnvidia-ml.so library for an RPM that was built with NVML Autodetect
enabled. His problem is the opposite - he's already using NVML
autodetect, but wants to disable that feature on a single node, where it
looks like that node isn't using RPMs with NVML support.
Prentice
On 2/19/21 3:43 PM, Robert Kudyba wrote:
> have you seen this? https://bugs.schedmd.com/show_bug.cgi?id=7919#c7
> <https://bugs.schedmd.com/show_bug.cgi?id=7919#c7>, fixed in 20.06.1
>
> On Fri, Feb 19, 2021 at 11:34 AM Paul Brunk <pbrunk at uga.edu
> <mailto:pbrunk at uga.edu>> wrote:
>
> Hi all:
>
> (I hope plague and weather are being visibly less than maximally cruel
> to you all.)
>
> In short, I was trying to exempt a node from NVML Autodetect, and
> apparently introduced a syntax error in gres.conf. This is not an
> urgent matter for us now, but I'm curious what went wrong. Thanks for
> lending any eyes to this!
>
> More info:
>
> Slurm 20.02.6, CentOS 7.
>
> We've historically had only this in our gres.conf:
> AutoDetect=nvml
>
> Each of our GPU nodes has e.g. 'Gres=gpu:V100:1' as part of its
> NodeName entry (GPU models vary across them).
>
> I wanted to exempt one GPU node from the autodetect (was curious about
> the presence or absence of the GPU model subtype designation,
> e.g. 'V100' vs. 'v100s'), so I changed gres.conf to this (modelled
> after 'gres.conf' man page):
>
> AutoDetect=nvml
> NodeName=a1-10 AutoDetect=off Name=gpu File=/dev/nvidia0
>
> I restarted slurmctld, then "scontrol reconfigure". Each node got a
> fatal error parsing gres.conf, causing RPC failure between slurmctld
> and nodes, causing slurmctld to consider the nodes failed.
>
> Here's how it looked to slurmctld:
>
> [2021-02-04T13:36:30.482] backfill: Started
> JobId=1469772_3(1473148) in batch on ra3-6
> [2021-02-04T15:14:48.642] error: Node ra3-6 appears to have a
> different slurm.conf than the slurmctld. This could cause issues
> with communication and functionality. Please review both files
> and make sure they are the same. If this is expected ignore, and
> set DebugFlags=NO_CONF_HASH in your slurm.conf.
> [2021-02-04T15:25:40.258] agent/is_node_resp: node:ra3-6
> RPC:REQUEST_PING : Communication connection failure
> [2021-02-04T15:39:49.046] requeue job JobId=1443912 due to failure
> of node ra3-6
>
> And to the slurmd's :
>
> [2021-02-04T15:14:50.730] Message aggregation disabled
> [2021-02-04T15:14:50.742] error: Parsing error at unrecognized
> key: AutoDetect
> [2021-02-04T15:14:50.742] error: Parse error in file
> /var/lib/slurmd/conf-cache/gres.conf line 2: " AutoDetect=off
> Name=gpu File=/dev/nvidia0"
> [2021-02-04T15:14:50.742] fatal: error opening/reading
> /var/lib/slurmd/conf-cache/gres.conf
>
> Reverting to the original, one-line gres.conf reverted the cluster
> to production state.
>
> --
> Paul Brunk, system administrator
> Georgia Advanced Computing Resource Center
> Enterprise IT Svcs, the University of Georgia
>
>
--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210223/87f7e1bd/attachment.htm>
More information about the slurm-users
mailing list