Dear all,
I am having trouble finalizing the configuration of the backup
controller for my slurm cluster.
In principle, if no job is running everything seems fine: both the
slurmctld services on the
primary and the backup controller are running and if I stop the
service on the primary controller
after 10s more or less (SlurmctldTimeout = 10 sec) the backup
controller takes over.
Also, if I run the sinfo or squeue command during the 10s of
inactivity, the shell stay pending
but it recover perfectly after the time needed by the backup
controller to take control and it
works the same when the primary controller is back.
Unfortunately, if I try to do the same test while a job is running
there are two different
behaviors depending on the initial scenario.
1st scenario:
Both the primary and backup controller are fine. I launch a batch
script and I verify the script
is running with sinfo and squeue. While the script is still
running I stop the service on the
primary controller with success but at this point everything gets
crazy:
on the backup controller in the slurmctld service log I find the
following errors:
slurmctld: error: Invalid RPC received REQUEST_JOB_INFO while in
standby mode
slurmctld: error: Invalid RPC received REQUEST_PARTITION_INFO
while in standby mode
slurmctld: error: Invalid RPC received REQUEST_JOB_INFO while in
standby mode
slurmctld: error: Invalid RPC received REQUEST_PARTITION_INFO
while in standby mode
slurmctld: error: slurm_accept_msg_conn poll: Bad address
slurmctld: error: slurm_accept_msg_conn poll: Bad address
and the commands sinfo and squeue are Unable to contact slurm
controller (connect failure).
2nd scenario:
the primary controller is stopped and I launch a batch job while
the backup controller
is the only one working. While the job is running, I restart the
slurmctld service on the primary
controller. In this case the primary controller takes over
immediately: everything is smooth
and safe and the sinfo and squeue commands continue to work
perfectly.
What might be the problem?
Many thanks in advance!
Miriam