[slurm-users] slurm_persist_conn_open_without_init: failed to open persistent connection to host

Sushil Mishra sushilbioinfo at gmail.com
Thu Dec 1 21:33:33 UTC 2022


Many thanks, William. That may have been the issue. I changed the hostname
to FQDN and  "StorageHost=localhost" and now it seems to try connecting to
the database.

[root at mannose sushil]# cat /var/log/slurm/slurmctld.log
[2022-12-01T15:26:50.942] Job accounting information stored, but details
not gathered
[2022-12-01T15:26:50.943] slurmctld version 20.11.9 started on cluster
mannose
[2022-12-01T15:26:52.949] error: If munged is up, restart with
--num-threads=10
[2022-12-01T15:26:52.949] error: Munge encode failed: Failed to access
"Abcd_123": No such file or directory
[2022-12-01T15:26:52.950] error: slurm_send_node_msg: g_slurm_auth_create:
REQUEST_PERSIST_INIT has authentication error: Invalid authentication
credential
[2022-12-01T15:26:52.950] error: slurm_persist_conn_open: failed to send
persistent connection init message to localhost:6819
[2022-12-01T15:26:52.950] error: Sending PersistInit msg: Protocol
authentication error
[2022-12-01T15:26:52.950] accounting_storage/slurmdbd:
clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817
with slurmdbd
[2022-12-01T15:26:54.954] error: If munged is up, restart with
--num-threads=10
[2022-12-01T15:26:54.954] error: Munge encode failed: Failed to access
"Abcd_123": No such file or directory
[2022-12-01T15:26:54.954] error: slurm_send_node_msg: g_slurm_auth_create:
REQUEST_PERSIST_INIT has authentication error: Invalid authentication
credential
[2022-12-01T15:26:54.954] error: slurm_persist_conn_open: failed to send
persistent connection init message to localhost:6819
[2022-12-01T15:26:54.954] error: Sending PersistInit msg: Protocol
authentication error
[2022-12-01T15:26:54.955] error: Association database appears down, reading
from state file.
[2022-12-01T15:26:54.955] error: Unable to get any information from the
state file
[2022-12-01T15:26:54.955] fatal: slurmdbd and/or database must be up at
slurmctld start time

"Abcd_123" is the password. This password works to access the database:


[root at mannose sushil]# mysql -p -u slurm
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 581
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input
statement.

MariaDB [(none)]> show grants;
+--------------------------------------------------------------------------------------------------------------+
| Grants for slurm at localhost
                                    |
+--------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'slurm'@'localhost' IDENTIFIED BY PASSWORD
'*0E54A04D59B6C9F7B7B6269BE7F30AD3E3409895' |
| GRANT ALL PRIVILEGES ON `slurm_acct_db`.* TO 'slurm'@'localhost' WITH
GRANT OPTION                           |
+--------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

MariaDB [(none)]>

Any pointers to fix this?

best,
Sushil


On Wed, Nov 30, 2022 at 5:36 PM William Brown <william at signalbox.org.uk>
wrote:

> If this is a single host machine I suggest checking the /etc/hosts file to
> make sure that ‘mannose’ is listed as you expect.  It is generally advised
> to use FQDNs for host names; the fact that the message “connection to
> host:mannose:6819: Connection refused” used a short name may mean that in
> a configuration file you have a shortname.   Equally the incoming
> connection may be coming not from the IP of ‘mannose’ but from localhost
> (127.0.0.1 if you are using only IPv4).
>
>
>
> You also have a cluster name that looks like an FQDN, you may want to
> change that to something else; the cluster name is I think an abstract
> name, where host names must be for real nodes that are resolvable.
>
>
>
> You may also find information in /var/log/messages or /var/log/secure….if
> applicable to your Linux distro.
>
>
>
> I use Slurm with firewalld and it is fine usually.
>
>
>
> William
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Sushil Mishra
> *Sent:* 30 November 2022 22:44
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] slurm_persist_conn_open_without_init: failed to
> open persistent connection to host
>
>
>
> Hi all,
>
>
>
> I installed slurm and enable accounting in a single-node machine, i.e same
> server is the master and computing node. I mainly followed this page for
> instructions:
>
> https://southgreenplatform.github.io/trainings/hpc/slurminstallation/
>
> After enabling accounting I am having problems in starting
> slurmctld.service.
>
> [root at mannose sushil]# cat /var/log/slurm/slurmctld.log
> [2022-11-30T16:32:15.194] Job accounting information stored, but details
> not gathered
> [2022-11-30T16:32:15.195] slurmctld version 20.11.9 started on cluster
> mannose.olemiss.edu
> [2022-11-30T16:32:15.201] error: slurm_persist_conn_open_without_init:
> failed to open persistent connection to host:mannose:6819: Connection
> refused
> [2022-11-30T16:32:15.201] error: Sending PersistInit msg: Connection
> refused
> [2022-11-30T16:32:15.201] accounting_storage/slurmdbd:
> clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817
> with slurmdbd
> [2022-11-30T16:32:15.203] error: Sending PersistInit msg: Connection
> refused
> [2022-11-30T16:32:15.203] error: Association database appears down,
> reading from state file.
> [2022-11-30T16:32:15.203] error: Unable to get any information from the
> state file
> [2022-11-30T16:32:15.203] fatal: slurmdbd and/or database must be up at
> slurmctld start time
>
>
>
> It is not clear why slurm port 8619 is being used while I have
> SlurmctldPort=6817 and SlurmdPort=6818 set in clurm.conf. anyways, I opened
> all three posrts (6817, 6818 and 6819) using  'firewall-cmd --permanent
> --zone=public --add-port=6819/tcp'
>
>
>
> MariaDB [(none)]> show grants
>
>     -> ;
>
> +--------------------------------------------------------------------------------------------------------------+
> | Grants for slurm at localhost
>                                       |
>
> +--------------------------------------------------------------------------------------------------------------+
> | GRANT USAGE ON *.* TO 'slurm'@'localhost' IDENTIFIED BY PASSWORD
> '*0E54A04D59B6C9F7B7B6269BE7F30AD3E3409895' |
> | GRANT ALL PRIVILEGES ON `slurm_acct_db`.* TO 'slurm'@'localhost' WITH
> GRANT OPTION                           |
>
> +--------------------------------------------------------------------------------------------------------------+
> 2 rows in set (0.00 sec)
>
> MariaDB [(none)]> quit
>
>
>
> Can someone help in figuring out possibly what is going wrong?
>
>
>
> Best,
>
> SK
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221201/d604daf2/attachment.htm>


More information about the slurm-users mailing list