[slurm-users] Re: controller backup slurmctld error while takeover

25 Mar 2024


      I would hazard to guess that the DNS is not working fully from or for 
the nodes themselves.
Validate that you can ping the nodes by name from the backup controller. 
Also verify they are named what the dns says they are.  And validate you 
can ping the backup controller from the nodes by the name it has in the 
slurm.conf file.
Also, a quick way to do the failover check is to run (from the backup 
controller): scontrol takeover
Brian Andrus
On 3/25/2024 1:39 PM, Miriam Olmi wrote:
...
Hi Brian,
Thanks for replying.
In my first message I forgot to specify that the primary and the 
backup controller have a shared filesystem mounted.
The SaveStateLocation points to a directory placed on the shared 
filesystem so both the primary and the backup controller are really 
reading/writing the very same files.
Any other ideas?
Thanks again,
Miriam
Il 25 marzo 2024 19:23:23 CET, Brian Andrus via slurm-users 
slurm-users@lists.schedmd.com ha scritto:
Quick correction, it is SaveStateLocation not SlurmSaveState.
Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:

    Dear all, I am having trouble finalizing the configuration of
    the backup controller for my slurm cluster. In principle, if
    no job is running everything seems fine: both the slurmctld
    services on the primary and the backup controller are running
    and if I stop the service on the primary controller after 10s
    more or less (SlurmctldTimeout = 10 sec) the backup controller
    takes over. Also, if I run the sinfo or squeue command during
    the 10s of inactivity, the shell stay pending but it recover
    perfectly after the time needed by the backup controller to
    take control and it works the same when the primary controller
    is back. Unfortunately, if I try to do the same test while a
    job is running there are two different behaviors depending on
    the initial scenario. 1st scenario: Both the primary and
    backup controller are fine. I launch a batch script and I
    verify the script is running with sinfo and squeue. While the
    script is still running I stop the service on the primary
    controller with success but at this point everything gets
    crazy: on the backup controller in the slurmctld service log I
    find the following errors: slurmctld: error: Invalid RPC
    received REQUEST_JOB_INFO while in standby mode slurmctld:
    error: Invalid RPC received REQUEST_PARTITION_INFO while in
    standby mode slurmctld: error: Invalid RPC received
    REQUEST_JOB_INFO while in standby mode slurmctld: error:
    Invalid RPC received REQUEST_PARTITION_INFO while in standby
    mode slurmctld: error: slurm_accept_msg_conn poll: Bad address
    slurmctld: error: slurm_accept_msg_conn poll: Bad address and
    the commands sinfo and squeue are Unable to contact slurm
    controller (connect failure). 2nd scenario: the primary
    controller is stopped and I launch a batch job while the
    backup controller is the only one working. While the job is
    running, I restart the slurmctld service on the primary
    controller. In this case the primary controller takes over
    immediately: everything is smooth and safe and the sinfo and
    squeue commands continue to work perfectly. What might be the
    problem? Many thanks in advance! Miriam

2025

2024

[slurm-users] Re: controller backup slurmctld error while takeover