[slurm-users] [External] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

Michael Robbert mrobbert at mines.edu
Thu Apr 23 21:03:18 UTC 2020


I’m pretty sure that you should only need to restart slurmd on the node that was reporting the problem. If it put the node into a drained state you may need to manually undrain it using scontrol.

 

Testing job performance is not the job of the scheduler it just schedules the jobs that you tell it to. You’ll need to run those tests yourself. 

 

Mike

 

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Robert Kudyba <rkudyba at fordham.edu>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Thursday, April 23, 2020 at 12:55
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] [External] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

 

CAUTION: This email originated from outside of the Colorado School of Mines organization. Do not click on links or open attachments unless you recognize the sender and know the content is safe.

 

 

 

On Thu, Apr 23, 2020 at 1:43 PM Michael Robbert <mrobbert at mines.edu> wrote:

It looks like you have hyper-threading turned on, but haven’t defined the ThreadsPerCore=2. You either need to turn off Hyper-threading in the BIOS or changed the definition of ThreadsPerCore in slurm.conf.

 

Nice find. node003 has hyper threading enabled but node001 and node002 do not:

[root at node001 ~]# dmidecode -t processor | grep -E '(Core Count|Thread Count)'
        Core Count: 12
        Thread Count: 12
        Core Count: 12
        Thread Count: 12

[root at node003 ~]# dmidecode -t processor | grep -E '(Core Count|Thread Count)'
        Core Count: 12
        Thread Count: 24
        Core Count: 12

I found a great mini script to disable hyperthreading without reboot. I did get the following warning but I don't think it's a big issue:

 WARNING, didn't collect load info for all cpus, balancing is broken

 

Do I have to restart slurmctl on the head node and/or slurmd on node003?

 

Side question, are there ways with Slurm to test if hyperthreading improves performance and job speed?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200423/8819e8dc/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5173 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200423/8819e8dc/attachment.bin>


More information about the slurm-users mailing list