[slurm-users] Rebooted Nodes & Jobs Stuck in Cleaning State

Wed Oct 10 15:08:45 MDT 2018

Hi,

Hopefully this isn't an obvious fix I'm missing. We have a large number of KNL nodes that can get rebooted when their memory or cluster modes are changed by users. I never heard any complaints when running Slurm v16.05.10, but I've seen a number of issues since our upgrade a couple months ago to v17.11.7. Even when changing the modes of 1 single KNL that requires a reboot, the job will almost surely go from a CF state to a PD state with reason (Cleaning). None of our configuration changed, so I'm not sure where this is stemming from.

In the below logs, it seems the following order of operations happens. Slurm requests the node and gives the node the state changes. The node reboots and switches into the correct mode. The node gets put into a "failed state" likely because of our nhc health checks? The node eventually gets to an idle state when fully up, but the job remains in Cleaning. I can't even forcefully resume the job, it needs to be killed. I can then delete the job and resubmit immediately and the job will run. So why is Slurm having trouble getting from PD Cleaning state to Running? Again, this wasn't previously an issue.

Here is what I described above:

The node is allocated:
[2018-10-10T07:04:05.759] sched: Allocate JobID=680790 NodeList=knl-0008 #CPUs=64 Partition=knlall

The node reboots into the new configuration. It fails because of the nhc health check failure? This is a normal failure because GPFS mounts can take some time especially if a large number of nodes were just configured. Health check is set to 30 seconds fyi:
[2018-10-10T07:11:13.590] update_node: node knl-0008 reason set to: NHC: check_fs_mount:  /blues/gpfs/proj/0 not mounted; directory /blues/gpfs/proj/0 missing (auto-fixed)
[2018-10-10T07:11:13.590] update_node: DRAIN/FAIL request for node knl-0008 which is allocated and being powered up. Requeueing jobs

The jobs is requeued here and this is when it's set to cleaning and never gets out of this state:
[2018-10-10T07:11:13.590] requeue job 680790 due to failure of node knl-0008
[2018-10-10T07:11:13.590] Requeuing JobID=680790 State=0x0 NodeCnt=0
[2018-10-10T07:11:13.590] update_node: node knl-0008 state set to DRAINED*
[2018-10-10T07:11:14.274] Node knl-0008 rebooted 85 secs ago
[2018-10-10T07:11:14.275] _update_node_avail_features: nodes knl-0008 available features set to: knl,cache,hybrid,flat,auto,a2a,snc2,snc4,hemi,quad
[2018-10-10T07:11:14.275] _update_node_active_features: nodes knl-0008 active features set to: knl,cache,quad

Node starts getting back to normal:
[2018-10-10T07:11:14.275] Node knl-0008 now responding
[2018-10-10T07:11:15.768] Job 680790 no longer waiting for node boot
[2018-10-10T07:11:29.842] update_node: node knl-0008 reason set to: NHC: check_fs_mount:  /blues/gpfs/proj/0 not mounted
[2018-10-10T07:11:29.842] update_node: node knl-0008 state set to DRAINED
[2018-10-10T07:12:10.033] error: Nodes knl-0008 not responding
[2018-10-10T07:12:29.137] update_node: node knl-0008 reason set to: NHC: check_fs_mount:  /blues/gpfs/group/2 not mounted
[2018-10-10T07:12:29.137] update_node: node knl-0008 state set to DRAINED

Node finally becomes idle now that it passed all of its checks:
[2018-10-10T07:13:19.622] update_node: node knl-0008 state set to IDLE

Have to kill and resubmit the job because it's stuck in cleaning:
[2018-10-10T07:14:05.702] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 680790 uid 4688
[2018-10-10T07:14:05.703] _job_signal: of pending JobID=680790 State=0x4 NodeCnt=0 successful

New job runs successfully:
[2018-10-10T07:14:10.259] sched: Allocate JobID=680795 NodeList=knl-0008 #CPUs=64 Partition=knlall

I appreciate any feedback. 

Thanks!
--
John Roberts
HPC Systems Administrator