[slurm-users] 4 sockets but "
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Wed Jul 21 18:27:06 UTC 2021
On 21-07-2021 11:56, Diego Zuccato wrote:
> I suspendend testing config changes to update another machine. In the
> last test I added "CPUs=192" to the noe definition, restarted slurmctld
> and nothing changed.
> When I returned, I checked again and slurm reported 192 CPUs! Magic?
> I now removed CPUs=192, restarted slurmctld and it keeps seeing all CPUs...
> What should I think?
Did you distribute the new slurm.conf to all compute nodes after the
change? Did you do "scontrol reconfig" for the slurmd daemons to pick
up the changes? This is standard procedure when making any changes to
slurm.conf, read about "reconfigure" in the scontrol man-page.
The Configless Slurm (https://slurm.schedmd.com/configless_slurm.html)
from 20.02 makes distribution of slurm.conf really simple.
> But another problem surfaces: slurmtop seems not to handle so many CPUs
> gracefully and throws a lot of errors, but that should be something
For monitoring the state of compute nodes and their jobs, I recommend
I use "pestat -F" many times every day to see if any jobs are misbehaving.
More information about the slurm-users