[slurm-users] what is the possible reason for secondary slurmctld node not allocate job after takeover?
Brian Andrus
toomuchit at gmail.com
Fri Jun 3 13:15:43 UTC 2022
Offhand, I would suggest double check munge and versions of
slurmd/slurmctld.
Brian Andrus
On 6/3/2022 3:17 AM, taleintervenor at sjtu.edu.cn wrote:
>
> Hi, all:
>
> Our cluster set up 2 slurm control node and scontrol show config as below:
>
> > scontrol show config
>
> …
>
> SlurmctldHost[0] = slurm1
>
> SlurmctldHost[1] = slurm2
>
> StateSaveLocation = /etc/slurm/state
>
> …
>
> Of course we have make sure both node has the some slurm conf and
> mount the same nfs on StateSaveLocation and can read/write it. (but
> there system is different, slurm1 is centos7 and slurm2 is centos8)
>
> When slurm1 control the cluster and slurm2 work in standby mode, the
> cluster has no problem.
>
> But when we use “scontrol takeover” on slurm2 to switch the primary
> role, we find new-submit jobs all stuck in PD state.
>
> No job will be allocated resource by slurm2, no matter how long we
> wait. Meanwhile old running jobs can complete without problem, and
> query command like “sinfo”, “sacct” all work well.
>
> The pending reason is firstly shown as “priority” in squeue, but after
> we manually update the priority, it become “none” reason and still
> stuck in PD state.
>
> During slurm2 primary period, there is no significant error in
> slurmctld.log. Only after we restart the slurm1 service to let slurm2
> return to standby role, it report lots of error as:
>
> error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in
> standby mode
>
> error: Invalid RPC received REQUEST_COMPLETE_PROLOG while in standby mode
>
> error: Invalid RPC received REQUEST_COMPLETE_JOB_ALLOCATION while in
> standby mode
>
> So is there any suggestion to find the reason why slurm2 work
> abnormally as primary controller?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220603/385b7889/attachment.htm>
More information about the slurm-users
mailing list