[slurm-users] Update: Rebooted Nodes & Jobs Stuck in Cleaning State

Mon Oct 15 13:01:20 MDT 2018

If anyone saw my first post below, just posting an update. I was able to get around this finally by setting the health check (nhc) to non-execute on boot. I told the slurmd service to only start after all of the GPFS mounts are fully present with a pre start script check and then and only then re-add the execute bit to nhc. Adds about 20 more seconds to the boot process, but it works. 

The root of the issue is that before GPFS has finished mounting everything, nhc runs on boot and then drains the node (expected because GPFS is a little slow to get everything in place). This requeues the job (also expected apparently since v17.11.4 of Slurm). When our jobs are requeued, they get stuck in PD (Cleaning) state and you can't clear it without killing the job. If I could figure out why a job can't properly requeue itself, I can remove this small hack on startup.

Thanks!
John 

On 10/10/18, 4:08 PM, "Roberts, John E." <jeroberts at anl.gov> wrote:

    Hi,

    Hopefully this isn't an obvious fix I'm missing. We have a large number of KNL nodes that can get rebooted when their memory or cluster modes are changed by users. I never heard any complaints when running Slurm v16.05.10, but I've seen a number of issues since our upgrade a couple months ago to v17.11.7. Even when changing the modes of 1 single KNL that requires a reboot, the job will almost surely go from a CF state to a PD state with reason (Cleaning). None of our configuration changed, so I'm not sure where this is stemming from.

    In the below logs, it seems the following order of operations happens. Slurm requests the node and gives the node the state changes. The node reboots and switches into the correct mode. The node gets put into a "failed state" likely because of our nhc health checks? The node eventually gets to an idle state when fully up, but the job remains in Cleaning. I can't even forcefully resume the job, it needs to be killed. I can then delete the job and resubmit immediately and the job will run. So why is Slurm having trouble getting from PD Cleaning state to Running? Again, this wasn't previously an issue.

    Here is what I described above:

    The node is allocated:
    [2018-10-10T07:04:05.759] sched: Allocate JobID=680790 NodeList=knl-0008 #CPUs=64 Partition=knlall

    The node reboots into the new configuration. It fails because of the nhc health check failure? This is a normal failure because GPFS mounts can take some time especially if a large number of nodes were just configured. Health check is set to 30 seconds fyi:
    [2018-10-10T07:11:13.590] update_node: node knl-0008 reason set to: NHC: check_fs_mount:  /blues/gpfs/proj/0 not mounted; directory /blues/gpfs/proj/0 missing (auto-fixed)
    [2018-10-10T07:11:13.590] update_node: DRAIN/FAIL request for node knl-0008 which is allocated and being powered up. Requeueing jobs

    The jobs is requeued here and this is when it's set to cleaning and never gets out of this state:
    [2018-10-10T07:11:13.590] requeue job 680790 due to failure of node knl-0008
    [2018-10-10T07:11:13.590] Requeuing JobID=680790 State=0x0 NodeCnt=0
    [2018-10-10T07:11:13.590] update_node: node knl-0008 state set to DRAINED*
    [2018-10-10T07:11:14.274] Node knl-0008 rebooted 85 secs ago
    [2018-10-10T07:11:14.275] _update_node_avail_features: nodes knl-0008 available features set to: knl,cache,hybrid,flat,auto,a2a,snc2,snc4,hemi,quad
    [2018-10-10T07:11:14.275] _update_node_active_features: nodes knl-0008 active features set to: knl,cache,quad

    Node starts getting back to normal:
    [2018-10-10T07:11:14.275] Node knl-0008 now responding
    [2018-10-10T07:11:15.768] Job 680790 no longer waiting for node boot
    [2018-10-10T07:11:29.842] update_node: node knl-0008 reason set to: NHC: check_fs_mount:  /blues/gpfs/proj/0 not mounted
    [2018-10-10T07:11:29.842] update_node: node knl-0008 state set to DRAINED
    [2018-10-10T07:12:10.033] error: Nodes knl-0008 not responding
    [2018-10-10T07:12:29.137] update_node: node knl-0008 reason set to: NHC: check_fs_mount:  /blues/gpfs/group/2 not mounted
    [2018-10-10T07:12:29.137] update_node: node knl-0008 state set to DRAINED

    Node finally becomes idle now that it passed all of its checks:
    [2018-10-10T07:13:19.622] update_node: node knl-0008 state set to IDLE

    Have to kill and resubmit the job because it's stuck in cleaning:
    [2018-10-10T07:14:05.702] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 680790 uid 4688
    [2018-10-10T07:14:05.703] _job_signal: of pending JobID=680790 State=0x4 NodeCnt=0 successful

    New job runs successfully:
    [2018-10-10T07:14:10.259] sched: Allocate JobID=680795 NodeList=knl-0008 #CPUs=64 Partition=knlall

    I appreciate any feedback. 

    Thanks!
    --
    John Roberts
    HPC Systems Administrator