[slurm-users] All nodes within one partition reboot unexpectedly

Tue Jan 2 08:25:51 UTC 2024

Hi all,

I had a slurm partition gpu_gmx with the following configuration (Slurm version: 20.11.9):

> NodeName=node[09-11] Gres=gpu:rtx4080:1 Sockets=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62000 State=UNKNOWN
> NodeName=node[12-14] Gres=gpu:rtx4070ti:1 Sockets=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62000 State=UNKNOWN
> PartitionName=gpu_gmx Nodes=node[09-14] Default=NO MaxTime=UNLIMITED State=UP

A job running on node11 had a problem, which then triggered the reboot of all nodes (node[09-14]) within the same partition (cat /var/log/slurmctld.log):

> [2023-12-26T23:04:23.200] Batch JobId=25061 missing from batch node node11 (not found BatchStartTime after startup), Requeuing job
> [2023-12-26T23:04:23.200] _job_complete: JobId=25061 WTERMSIG 126
> [2023-12-26T23:04:23.200] _job_complete: JobId=25061 cancelled by node failure
> [2023-12-26T23:04:23.200] _job_complete: requeue JobId=25061 due to node failure
> [2023-12-26T23:04:23.200] _job_complete: JobId=25061 done
> [2023-12-26T23:04:23.200] validate_node_specs: Node node11 unexpectedly rebooted boot_time=1703603052 last response=1703602983
> [2023-12-26T23:04:23.222] validate_node_specs: Node node09 unexpectedly rebooted boot_time=1703603052 last response=1703602983
> [2023-12-26T23:04:23.579] Batch JobId=25060 missing from batch node node10 (not found BatchStartTime after startup), Requeuing job
> [2023-12-26T23:04:23.579] _job_complete: JobId=25060 WTERMSIG 126
> [2023-12-26T23:04:23.579] _job_complete: JobId=25060 cancelled by node failure
> [2023-12-26T23:04:23.579] _job_complete: requeue JobId=25060 due to node failure
> [2023-12-26T23:04:23.579] _job_complete: JobId=25060 done
> [2023-12-26T23:04:23.579] validate_node_specs: Node node10 unexpectedly rebooted boot_time=1703603052 last response=1703602983
> [2023-12-26T23:04:23.581] validate_node_specs: Node node14 unexpectedly rebooted boot_time=1703603051 last response=1703602983
> [2023-12-26T23:04:23.654] validate_node_specs: Node node13 unexpectedly rebooted boot_time=1703603052 last response=1703602983
> [2023-12-26T23:04:24.681] validate_node_specs: Node node12 unexpectedly rebooted boot_time=1703603053 last response=1703602983
> [2023-12-27T04:46:42.461] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25060 uid 0
> [2023-12-27T04:46:43.822] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25061 uid 0

The operating systems are CentOS 7.9.2009 on master node, and CentOS 8.5.2111 on node[09-14]. Does anyone have a similar experience and have a clue how to resolve this?

Thanks in advance.

Best,
Jinglei