Patrick Goetz pgoetz at math.utexas.edu
Thu Aug 24 22:13:32 UTC 2023

Hi Rob -

Thanks for this suggestion. I'm sure I restarted slurmd on the nodes 
multiple times with nothing in the slurm log file on the node, but after

   # tail -f /var/slurm-llnl/slurmd.log
   # systemctl restart slurmd

I started to get errors in the log which eventually lead me to the solution.

To save future users the days of frustration I just experienced, here is 
what I discovered.

All the problems were confined to the shared slurm.conf file.  As a 
reminder, all this just worked in Slurm 17.x.

Slurm 19.05 no longer likes this syntax:

   NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40

The log file on the node included this error message:

   error: Node configuration differs from hardware: Procs=40:40(hw) 
Boards=1:1(hw) SocketsPerBoard=40:2(hw) CoresPerSocket=1:10(hw) 

Notice that it's somehow auto-detecting the wrong hardware information. 
The solution was to just use precisely what's reported by

   slurmd -C

on the node:

   NodeName=titan-[3-15] Gres=gpu:titanv:8 CPUs=40 Boards=1 
SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=250000

But that wasn't the only issue.  There was also this:

   WARNING: A line in gres.conf for GRES gpu has 8 more configured than 
expected in slurm.conf. Ignoring extra GRES.

It's calling this a warning, but

   # scontrol show node titan-11 | grep Reason

revealed that this match was causing the node to drain immediately after 
being set to idle.  The problem was this:



For some reason this syntax was acceptable to Slurm 17, but not Slurm 
19.  The fix was

   Gres=gpu:titanv:8  -->  Gres=gpu:8

Final correct NodeName syntax:

   NodeName=titan-[3-15] Gres=gpu:8 CPUs=40 Boards=1 SocketsPerBoard=2 
CoresPerSocket=10 ThreadsPerCore=2 RealMemory=250000

Researching all this raised a number of questions, e.g. do I need to 
express CPU affinity in gres.conf, but at least the users now have at 
least the functionality they enjoyed previously.

On 8/24/23 11:16, Groner, Rob wrote:
> Ya, I agree about the invalid argument not being much help.
> In times past when I encountered issues like that, I typically tried:
>   * restart slurmd on the compute node.  Watch its log to see what it
>     complains about.  Usually it's about memory.
>   * Set the configuration of the node to whatever slurmd -C says, or set
>     config_overrides in slurm.conf
> Rob
