All nodes within one partition reboot unexpectedly - slurm-users

2 Jan 2024


      Hi all,
I had a slurm partition gpu_gmx with the following configuration (Slurm version: 20.11.9):
...
NodeName=node[09-11] Gres=gpu:rtx4080:1 Sockets=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62000 State=UNKNOWN
NodeName=node[12-14] Gres=gpu:rtx4070ti:1 Sockets=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62000 State=UNKNOWN
PartitionName=gpu_gmx Nodes=node[09-14] Default=NO MaxTime=UNLIMITED State=UP
A job running on node11 had a problem, which then triggered the reboot of all nodes (node[09-14]) within the same partition (cat /var/log/slurmctld.log):
...
[2023-12-26T23:04:23.200] Batch JobId=25061 missing from batch node node11 (not found BatchStartTime after startup), Requeuing job
[2023-12-26T23:04:23.200] _job_complete: JobId=25061 WTERMSIG 126
[2023-12-26T23:04:23.200] _job_complete: JobId=25061 cancelled by node failure
[2023-12-26T23:04:23.200] _job_complete: requeue JobId=25061 due to node failure
[2023-12-26T23:04:23.200] _job_complete: JobId=25061 done
[2023-12-26T23:04:23.200] validate_node_specs: Node node11 unexpectedly rebooted boot_time=1703603052 last response=1703602983
[2023-12-26T23:04:23.222] validate_node_specs: Node node09 unexpectedly rebooted boot_time=1703603052 last response=1703602983
[2023-12-26T23:04:23.579] Batch JobId=25060 missing from batch node node10 (not found BatchStartTime after startup), Requeuing job
[2023-12-26T23:04:23.579] _job_complete: JobId=25060 WTERMSIG 126
[2023-12-26T23:04:23.579] _job_complete: JobId=25060 cancelled by node failure
[2023-12-26T23:04:23.579] _job_complete: requeue JobId=25060 due to node failure
[2023-12-26T23:04:23.579] _job_complete: JobId=25060 done
[2023-12-26T23:04:23.579] validate_node_specs: Node node10 unexpectedly rebooted boot_time=1703603052 last response=1703602983
[2023-12-26T23:04:23.581] validate_node_specs: Node node14 unexpectedly rebooted boot_time=1703603051 last response=1703602983
[2023-12-26T23:04:23.654] validate_node_specs: Node node13 unexpectedly rebooted boot_time=1703603052 last response=1703602983
[2023-12-26T23:04:24.681] validate_node_specs: Node node12 unexpectedly rebooted boot_time=1703603053 last response=1703602983
[2023-12-27T04:46:42.461] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25060 uid 0
[2023-12-27T04:46:43.822] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25061 uid 0
The operating systems are CentOS 7.9.2009 on master node, and CentOS 8.5.2111 on node[09-14]. Does anyone have a similar experience and have a clue how to resolve this?
Thanks in advance.
Best,
Jinglei