[slurm-users] Re: Jobs aborting after slurmctld reload on Intel nodes - AMD unaffected

16 Apr 2026

      Have you run lstopo on Intel and AMD nodes?

Run it in text mode and graphical mode.

It might be worth running lstopo in the job prolog and epilog and looking
if the output changes

On Thu, Apr 16, 2026, 1:51 PM Pharthiphan Asokan via slurm-users <
slurm-users@lists.schedmd.com> wrote:
...
Hi team,
We’re observing job aborts on Intel-based nodes immediately after a
slurmctld reload. AMD nodes remain stable and jobs continue unaffected.
No system or Slurm configuration changes were made before the issue started.
Error observed:
error: Aborting JobID=1288 due to change in socket/core configuration of allocated nodes
Relevant node configuration (Intel node example):
NodeName=smc-h4-u19 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 RealMemory=510976 State=UNKNOWN Feature=model_SSG-1228-B
From logs:
error: valid_job_resources: smc-h4-u19 sockets:2:2, cores 64,32
error: Node configuration differs from hardware: CPUs=128:128(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)
What we’ve verified:
- No changes in BIOS, firmware, or hardware topology
   - No edits to slurm.conf or slurm_nodes.conf
   - Reload (scontrol reconfigure) triggers job aborts only on Intel nodes
   - AMD nodes remain intact through reloads
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

[slurm-users] Re: Jobs aborting after slurmctld reload on Intel nodes - AMD unaffected

John Hearns