[slurm-users] exempting a node from Gres Autodetect

Fri Feb 19 20:43:59 UTC 2021

have you seen this? https://bugs.schedmd.com/show_bug.cgi?id=7919#c7, fixed
in 20.06.1

On Fri, Feb 19, 2021 at 11:34 AM Paul Brunk <pbrunk at uga.edu> wrote:

> Hi all:
>
> (I hope plague and weather are being visibly less than maximally cruel
> to you all.)
>
> In short, I was trying to exempt a node from NVML Autodetect, and
> apparently introduced a syntax error in gres.conf.  This is not an
> urgent matter for us now, but I'm curious what went wrong.  Thanks for
> lending any eyes to this!
>
> More info:
>
> Slurm 20.02.6, CentOS 7.
>
> We've historically had only this in our gres.conf:
> AutoDetect=nvml
>
> Each of our GPU nodes has e.g. 'Gres=gpu:V100:1' as part of its
> NodeName entry (GPU models vary across them).
>
> I wanted to exempt one GPU node from the autodetect (was curious about
> the presence or absence of the GPU model subtype designation,
> e.g. 'V100' vs. 'v100s'), so I changed gres.conf to this (modelled
> after 'gres.conf' man page):
>
> AutoDetect=nvml
> NodeName=a1-10 AutoDetect=off Name=gpu File=/dev/nvidia0
>
> I restarted slurmctld, then "scontrol reconfigure".  Each node got a
> fatal error parsing gres.conf, causing RPC failure between slurmctld
> and nodes, causing slurmctld to consider the nodes failed.
>
> Here's how it looked to slurmctld:
>
> [2021-02-04T13:36:30.482] backfill: Started JobId=1469772_3(1473148) in
> batch on ra3-6
> [2021-02-04T15:14:48.642] error: Node ra3-6 appears to have a different
> slurm.conf than the slurmctld.  This could cause issues with communication
> and functionality.  Please review both files and make sure they are the
> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
> slurm.conf.
> [2021-02-04T15:25:40.258] agent/is_node_resp: node:ra3-6 RPC:REQUEST_PING
> : Communication connection failure
> [2021-02-04T15:39:49.046] requeue job JobId=1443912 due to failure of node
> ra3-6
>
> And to the slurmd's :
>
> [2021-02-04T15:14:50.730] Message aggregation disabled
> [2021-02-04T15:14:50.742] error: Parsing error at unrecognized key:
> AutoDetect
> [2021-02-04T15:14:50.742] error: Parse error in file
> /var/lib/slurmd/conf-cache/gres.conf line 2: " AutoDetect=off Name=gpu
> File=/dev/nvidia0"
> [2021-02-04T15:14:50.742] fatal: error opening/reading
> /var/lib/slurmd/conf-cache/gres.conf
>
> Reverting to the original, one-line gres.conf reverted the cluster to
> production state.
>
> --
> Paul Brunk, system administrator
> Georgia Advanced Computing Resource Center
> Enterprise IT Svcs, the University of Georgia
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210219/01d8c9bc/attachment.htm>