Hi team,
We’re observing job aborts on Intel-based nodes immediately after a slurmctld reload. AMD nodes remain stable and jobs continue unaffected. No system or Slurm configuration changes were made before the issue started.
Error observed:
error: Aborting JobID=1288 due to change in socket/core configuration of allocated nodes
Relevant node configuration (Intel node example):
NodeName=smc-h4-u19 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=1 RealMemory=510976 State=UNKNOWN Feature=model_SSG-1228-B
From logs:
error: valid_job_resources: smc-h4-u19 sockets:2:2, cores 64,32 error: Node configuration differs from hardware: CPUs=128:128(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)
What we’ve verified: