Dear all,
I am having trouble finalizing the configuration of the backup controller for my slurm cluster.
In principle, if no job is running everything seems fine: both the slurmctld services on the primary and the backup controller are running and if I stop the service on the primary controller after 10s more or less (SlurmctldTimeout = 10 sec) the backup controller takes over.
Also, if I run the sinfo or squeue command during the 10s of inactivity, the shell stay pending but it recover perfectly after the time needed by the backup controller to take control and it works the same when the primary controller is back.
Unfortunately, if I try to do the same test while a job is running there are two different behaviors depending on the initial scenario.
1st scenario: Both the primary and backup controller are fine. I launch a batch script and I verify the script is running with sinfo and squeue. While the script is still running I stop the service on the primary controller with success but at this point everything gets crazy:
on the backup controller in the slurmctld service log I find the following errors:
slurmctld: error: Invalid RPC received REQUEST_JOB_INFO while in standby mode slurmctld: error: Invalid RPC received REQUEST_PARTITION_INFO while in standby mode slurmctld: error: Invalid RPC received REQUEST_JOB_INFO while in standby mode slurmctld: error: Invalid RPC received REQUEST_PARTITION_INFO while in standby mode slurmctld: error: slurm_accept_msg_conn poll: Bad address slurmctld: error: slurm_accept_msg_conn poll: Bad address
and the commands sinfo and squeue are Unable to contact slurm controller (connect failure).
2nd scenario: the primary controller is stopped and I launch a batch job while the backup controller is the only one working. While the job is running, I restart the slurmctld service on the primary controller. In this case the primary controller takes over immediately: everything is smooth and safe and the sinfo and squeue commands continue to work perfectly.
What might be the problem?
Many thanks in advance!
Miriam