[slurm-users] 4 sockets but "

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Wed Jul 21 18:27:06 UTC 2021


Hi Diego,

On 21-07-2021 11:56, Diego Zuccato wrote:
> I suspendend testing config changes to update another machine. In the 
> last test I added "CPUs=192" to the noe definition, restarted slurmctld 
> and nothing changed.
> When I returned, I checked again and slurm reported 192 CPUs! Magic?
> I now removed CPUs=192, restarted slurmctld and it keeps seeing all CPUs...
> What should I think?

Did you distribute the new slurm.conf to all compute nodes after the 
change?  Did you do "scontrol reconfig" for the slurmd daemons to pick 
up the changes?  This is standard procedure when making any changes to 
slurm.conf, read about "reconfigure" in the scontrol man-page.

The Configless Slurm (https://slurm.schedmd.com/configless_slurm.html) 
from 20.02 makes distribution of slurm.conf really simple.

> But another problem surfaces: slurmtop seems not to handle so many CPUs 
> gracefully and throws a lot of errors, but that should be something 
> manageable...

For monitoring the state of compute nodes and their jobs, I recommend 
"pestat" from 
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat

I use "pestat -F" many times every day to see if any jobs are misbehaving.

/Ole



More information about the slurm-users mailing list