Jobs aborting after slurmctld reload on Intel nodes - AMD unaffected - slurm-users

16 Apr 2026

      Hi team,
We’re observing job aborts on Intel-based nodes immediately after a slurmctld reload. AMD nodes remain stable and jobs continue unaffected. No system or Slurm configuration changes were made before the issue started.
Error observed:

error: Aborting JobID=1288 due to change in socket/core configuration of allocated nodes

Relevant node configuration (Intel node example):

NodeName=smc-h4-u19 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 RealMemory=510976 State=UNKNOWN Feature=model_SSG-1228-B

From logs:

error: valid_job_resources: smc-h4-u19 sockets:2:2, cores 64,32
error: Node configuration differs from hardware: CPUs=128:128(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

What we’ve verified:

  *   No changes in BIOS, firmware, or hardware topology
  *   No edits to slurm.conf or slurm_nodes.conf
  *   Reload (scontrol reconfigure) triggers job aborts only on Intel nodes
  *   AMD nodes remain intact through reloads

Jobs aborting after slurmctld reload on Intel nodes - AMD unaffected

Pharthiphan Asokan

Ole Holm Nielsen

John Hearns

tags

participants (3)