[slurm-users] slurm_persist_conn_open_without_init: failed to open persistent connection to host

William Brown william at signalbox.org.uk
Wed Nov 30 23:34:21 UTC 2022


If this is a single host machine I suggest checking the /etc/hosts file to make sure that ‘mannose’ is listed as you expect.  It is generally advised to use FQDNs for host names; the fact that the message “connection to host:mannose:6819: Connection refused” used a short name may mean that in a configuration file you have a shortname.   Equally the incoming connection may be coming not from the IP of ‘mannose’ but from localhost (127.0.0.1 if you are using only IPv4).

 

You also have a cluster name that looks like an FQDN, you may want to change that to something else; the cluster name is I think an abstract name, where host names must be for real nodes that are resolvable.

 

You may also find information in /var/log/messages or /var/log/secure….if applicable to your Linux distro.

 

I use Slurm with firewalld and it is fine usually.

 

William

 

From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Sushil Mishra
Sent: 30 November 2022 22:44
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: [slurm-users] slurm_persist_conn_open_without_init: failed to open persistent connection to host

 

Hi all,

 

I installed slurm and enable accounting in a single-node machine, i.e same server is the master and computing node. I mainly followed this page for instructions:

https://southgreenplatform.github.io/trainings/hpc/slurminstallation/

After enabling accounting I am having problems in starting slurmctld.service. 

[root at mannose sushil]# cat /var/log/slurm/slurmctld.log 
[2022-11-30T16:32:15.194] Job accounting information stored, but details not gathered
[2022-11-30T16:32:15.195] slurmctld version 20.11.9 started on cluster mannose.olemiss.edu <http://mannose.olemiss.edu> 
[2022-11-30T16:32:15.201] error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:mannose:6819: Connection refused
[2022-11-30T16:32:15.201] error: Sending PersistInit msg: Connection refused
[2022-11-30T16:32:15.201] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
[2022-11-30T16:32:15.203] error: Sending PersistInit msg: Connection refused
[2022-11-30T16:32:15.203] error: Association database appears down, reading from state file.
[2022-11-30T16:32:15.203] error: Unable to get any information from the state file
[2022-11-30T16:32:15.203] fatal: slurmdbd and/or database must be up at slurmctld start time

 

It is not clear why slurm port 8619 is being used while I have SlurmctldPort=6817 and SlurmdPort=6818 set in clurm.conf. anyways, I opened all three posrts (6817, 6818 and 6819) using  'firewall-cmd --permanent --zone=public --add-port=6819/tcp'

 

MariaDB [(none)]> show grants

    -> ;
+--------------------------------------------------------------------------------------------------------------+
| Grants for slurm at localhost                                                                                   |
+--------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'slurm'@'localhost' IDENTIFIED BY PASSWORD '*0E54A04D59B6C9F7B7B6269BE7F30AD3E3409895' |
| GRANT ALL PRIVILEGES ON `slurm_acct_db`.* TO 'slurm'@'localhost' WITH GRANT OPTION                           |
+--------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

MariaDB [(none)]> quit

 

Can someone help in figuring out possibly what is going wrong? 

 

Best,

SK

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221130/059302a2/attachment.htm>


More information about the slurm-users mailing list