Hi folks

I'm in the process of standing up some Stampede2 Dell C3620p KNL nodes and I seem to be hitting a blind spot.

I previously "successfully" configured KNL's on an Intel board (S7200AP), with OpenHPC and Rocky8. I say "successfully" because it works but evidently my latest troubleshooting has revealed that I may have been lucky rather than an expert KNL integrator :-)

I thought I knew what I was doing, but after repeating my Intel KNL recipe with the Dell system, I have unearthed my ignorance with this wonderful (but deprecated) technology (anecdotally, the KNLs offer excellent performance and power efficiency for our workloads, particularly when contrasted with our alternative available hardware).

The first discovery was the "syscfg" for Intel boards is not the same as the "syscfg" for Dell boards. I've since sorted this out.

The second discovery was made while troubleshooting an issue that I'm hitting. After realising that the slurmd client nodes don't seem to be reading the "knl_generic.conf" parameters that are specified in /etc/slurm on the smshost (OpenHPC parlance for head node ... And it's a Slurm config less set up), I think my original Intel solution was working out of luck more than ingenuity.

For reference , the Slurm configuration for KNL now includes:

```
NodeFeaturesPlugins=knl_generic 
DebugFlags=NodeFeatures 
GresTypes=hbm
```

And I've created a separate "knl_generic.conf" that points to the Dell specific tools and features.

For the Dell system, slurmd seems to ignore my knl_generic.conf file and is drawing defaults from somewhere else. Slurm still considers SystemType to be Intel, SyscfgPath to be the default location, and SyscfgTimeout to be 1000. For Dell systems, Slurm needs to have SystemType=Dell and Timeout to be 1000. 

I don't understand why the nodes are not reading the knl_generic file - any help or clues would be appreciated.

Here's my theory on what is happening:

The Intel KNL system was successful by luck ... It probably exhibited the same ignore-the-config-file but ran default NodeFeatures for some generic knl_generic settings which are stored somewhere as default parameters. I must have just lucked out when I was using my Intel KNL system because it was using the defaults (that are compatible with Intel). 

If this assumption is correct, the Dell system is not working because it isn't compatible with the Intel defaults. 

Any clues on how to successfully invoke the config file (or better debuggingtechniwues to figure out why it isn't) would be appreciated.

I can share journalctl feedback if necessary. For now, I've tried changing ownership of the config files to root:slurm, copied knl_generic.conf to the compute nodes' /etc/slurm/ and also tried to specify the config file by running (on the compute nodes) "slurmd" with "-f" ... No joy; if slurmd runs successfully (when I don't screw up some random experimental settings) then it always seems to ignore knl_generic.conf and loads some default settings from somewhere.

A few questions:

1. Are there default settings stored somewhere? I might be barking up the wrong tree, although I've looked for files that may clash with the config file I've created but can't find any.

2. Is there a better way to force the knl_generic file to be loaded?

3. Is the configless Slurm somehow not reading the knl_generic file to the clients? I understand that all configuration files are read from the host server.

Many thanks for any help!


Regards / Groete / Sala(ni) Kahle
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Bryan Johnston
Senior HPC Technologist II
Lead: HPC Ecosystems Project
HPCwire 2024's Outstanding Leader in HPC

CHPC | www.chpc.ac.za | NICIS | nicis.ac.za
Centre for High Performance Computing

If you receive an email from me out of office hours for you, please do not feel obliged to respond during off-hours! 

Book time to meet with me