[slurm-users] what is the possible reason for secondary slurmctld node not allocate job after takeover?

Brian Andrus toomuchit at gmail.com
Fri Jun 3 13:15:43 UTC 2022


Offhand, I would suggest double check munge and versions of 
slurmd/slurmctld.

Brian Andrus

On 6/3/2022 3:17 AM, taleintervenor at sjtu.edu.cn wrote:
>
> Hi, all:
>
> Our cluster set up 2 slurm control node and scontrol show config as below:
>
> > scontrol show config
>
>>
> SlurmctldHost[0] = slurm1
>
> SlurmctldHost[1] = slurm2
>
> StateSaveLocation = /etc/slurm/state
>
>>
> Of course we have make sure both node has the some slurm conf and 
> mount the same nfs on StateSaveLocation and can read/write it. (but 
> there system is different, slurm1 is centos7 and slurm2 is centos8)
>
> When slurm1 control the cluster and slurm2 work in standby mode, the 
> cluster has no problem.
>
> But when we use “scontrol takeover” on slurm2 to switch the primary 
> role, we find new-submit jobs all stuck in PD state.
>
> No job will be allocated resource by slurm2, no matter how long we 
> wait. Meanwhile old running jobs can complete without problem, and 
> query command like “sinfo”, “sacct” all work well.
>
> The pending reason is firstly shown as “priority” in squeue, but after 
> we manually update the priority, it become “none” reason and still 
> stuck in PD state.
>
> During slurm2 primary period, there is no significant error in 
> slurmctld.log. Only after we restart the slurm1 service to let slurm2 
> return to standby role, it report lots of error as:
>
> error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in 
> standby mode
>
> error: Invalid RPC received REQUEST_COMPLETE_PROLOG while in standby mode
>
> error: Invalid RPC received REQUEST_COMPLETE_JOB_ALLOCATION while in 
> standby mode
>
> So is there any suggestion to find the reason why slurm2 work 
> abnormally as primary controller?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220603/385b7889/attachment.htm>


More information about the slurm-users mailing list