Jobs aborting after slurmctld reload on Intel nodes - AMD unaffected
Hi team, We’re observing job aborts on Intel-based nodes immediately after a slurmctld reload. AMD nodes remain stable and jobs continue unaffected. No system or Slurm configuration changes were made before the issue started. Error observed: error: Aborting JobID=1288 due to change in socket/core configuration of allocated nodes Relevant node configuration (Intel node example): NodeName=smc-h4-u19 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 RealMemory=510976 State=UNKNOWN Feature=model_SSG-1228-B From logs: error: valid_job_resources: smc-h4-u19 sockets:2:2, cores 64,32 error: Node configuration differs from hardware: CPUs=128:128(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw) What we’ve verified: * No changes in BIOS, firmware, or hardware topology * No edits to slurm.conf or slurm_nodes.conf * Reload (scontrol reconfigure) triggers job aborts only on Intel nodes * AMD nodes remain intact through reloads
On 4/16/26 14:05, Pharthiphan Asokan via slurm-users wrote:
Hi team, We’re observing job aborts on Intel-based nodes immediately after a | slurmctld| reload. AMD nodes remain stable and jobs continue unaffected. No system or Slurm configuration changes were made before the issue started. Error observed:
|error: Aborting JobID=1288 due to change in socket/core configuration of allocated nodes |
What's your Slurm version? Please run "slurmd -C" on each type of node, and verify that your slurm.conf NodeName=... lines agrees with this output. Any deviation could cause the problem that you're experiencing. Example output: $ slurmd -C NodeName=a045 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=385045 IHTH, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark
Have you run lstopo on Intel and AMD nodes? Run it in text mode and graphical mode. It might be worth running lstopo in the job prolog and epilog and looking if the output changes On Thu, Apr 16, 2026, 1:51 PM Pharthiphan Asokan via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi team, We’re observing job aborts on Intel-based nodes immediately after a slurmctld reload. AMD nodes remain stable and jobs continue unaffected. No system or Slurm configuration changes were made before the issue started. Error observed:
error: Aborting JobID=1288 due to change in socket/core configuration of allocated nodes
Relevant node configuration (Intel node example):
NodeName=smc-h4-u19 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 RealMemory=510976 State=UNKNOWN Feature=model_SSG-1228-B
From logs:
error: valid_job_resources: smc-h4-u19 sockets:2:2, cores 64,32 error: Node configuration differs from hardware: CPUs=128:128(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)
What we’ve verified:
- No changes in BIOS, firmware, or hardware topology - No edits to slurm.conf or slurm_nodes.conf - Reload (scontrol reconfigure) triggers job aborts only on Intel nodes - AMD nodes remain intact through reloads
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
participants (3)
-
John Hearns -
Ole Holm Nielsen -
Pharthiphan Asokan