Dear all,
We recently tried to fix our version of slurm in every node of our cluster. After the instalation (slurm 20.11.9) in one of the compute nodes, most of the commads (squeue, sinfo, scontrol show config etc) returns this error:
error: Unable to contact slurm controller (connect failure)
The .log files don't show any errors, we have both debugs values equal to debug5. Also, the rest of the cluster works as usual.
I appreciate any insight on what could be the cause.
Thank you and regards, Daniel
A few things to look at, make sure DNS/Host name resolution works, disable any firewalls for testing, you can lock it down after, make sure the slurm.conf file is the same on all nodes.
I've just done a 20.11.9 to 24.05.2 upgrade along with a Centos7.9 to rhel 9.10 upgrade on all my nodes.
Sid
Sid
On Tue, 19 Nov 2024, 03:23 Daniel Rodriguez Lopez (ext) via slurm-users, < slurm-users@lists.schedmd.com> wrote:
Dear all,
We recently tried to fix our version of slurm in every node of our cluster. After the instalation (slurm 20.11.9) in one of the compute nodes, most of the commads (squeue, sinfo, scontrol show config etc) returns this error:
error: Unable to contact slurm controller (connect failure)
The .log files don't show any errors, we have both debugs values equal to debug5. Also, the rest of the cluster works as usual.
I appreciate any insight on what could be the cause.
Thank you and regards, Daniel
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hi Daniel,
error: Unable to contact slurm controller (connect failure)
I appreciate any insight on what could be the cause.
Can you check that the slurmctld is up and running, and that the said commands work on the controller machine itself? If the slurmctld cannot be started as a service, try to run it in verbose debug mode (-D -vvv) and find out what might be wrong with it. If it runs in foreground, check the systemd service again. Proceed to compute nodes only when you are sure that the ctld is OK. (IIRC there was a flag in the systemd service definition that had to be adjusted after an upgrade, maybe you're hitting the same?)
Best, Steffen
Hi,
Thank you all for the early answers. We tried your suggestions and the problem was in the slurm.conf, we did not notice that the name of the control server had a typo.
Thank you, I really appreciate the help.
Best, Daniel