[slurm-users] Slurm - sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused (Zainul Abiddin)

Michael Smith msmith at tenstorrent.com
Tue Feb 2 19:51:23 UTC 2021


A few things to check here:


  *   Ensure that your firewall ports are open – ports 6817/6818/6819/3306
  *   Make sure that munge is working correctly:
$ munge -n | unmunge


  *   Make sure you go through the accounting web-page as well - https://slurm.schedmd.com/accounting.html
     *   In particular, ensure that you can connect to the MySQL server, create the slurm user within MySQL database, give it the required permissions, etc,  Go through the “Live example” on the accounting web-page.
  *   Walk through your log files – especially the slurmdbd.log file and clear up all errors.
  *   As a general comment, put in the fewest number of configuration options into your slurm.conf and slurmdbd.conf file as possible – use the defaults when you can.  Add items incrementally and carefully so you can back-out easily when you make mistakes (and you will!)
  *   In my slurm.conf, I also have specified the AccountingStorageHost, AccountingStorageUser and AccountingStoragePort – not sure if I need any of these though…

Mike

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of slurm-users-request at lists.schedmd.com <slurm-users-request at lists.schedmd.com>
Date: Tuesday, February 2, 2021 at 8:16 AM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: slurm-users Digest, Vol 40, Issue 4
Send slurm-users mailing list submissions to
        slurm-users at lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
        slurm-users-request at lists.schedmd.com

You can reach the person managing the list at
        slurm-users-owner at lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

   1. Slurm - sacct: error: slurm_persist_conn_open_without_init:
      failed to open persistent connection to host:localhost:6819:
      Connection refused (Zainul Abiddin)
   2. Re: Slurm - Munge configuration details (Benson Muite)


----------------------------------------------------------------------

Message: 1
Date: Tue, 2 Feb 2021 18:35:20 +0530
From: Zainul Abiddin <zainul1114 at gmail.com>
To: slurm-users at lists.schedmd.com
Subject: [slurm-users] Slurm - sacct: error:
        slurm_persist_conn_open_without_init: failed to open persistent
        connection to host:localhost:6819: Connection refused
Message-ID:
        <CAA9R82u0L7VdZDhvP_1KfWmVrLL-Cc5VhAVr2SgTuwN_1AXuUA at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi All,
I have done slurmdbd configuration and while i am trying to run account
manager with *sacct* i am getting below error.

[root at smaster ~]# sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
[root at smaster ~]#

My slurmdbd configuration :
[root at smaster ~]# cat /etc/slurm/slurmdbd.conf
AuthType=auth/munge
DbdAddr=localhost
DbdHost=localhost
SlurmUser=slurm
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePass=password
StorageUser=slurm
StorageLoc=slurm_acct_db

[root at smaster ~]# chown slurm: /etc/slurm/slurmdbd.conf
[root at smaster ~]# chmod 600 /etc/slurm/slurmdbd.conf
[root at smaster ~]# mkdir /var/log/slurm
[root at smaster ~]# touch /var/log/slurm/slurmdbd.log
[root at smaster ~]# chown slurm: /var/log/slurm/slurmdbd.log
[root at smaster ~]# scontrol show config | grep AccountingStorageHost
AccountingStorageHost   = localhost

Note:
i have edited file /etc/slurm/slurm.conf and modified the below line
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
Then restarted all the services

[root at smaster ~]# for i in munge slurmd slurmctld slurmdbd; do service $i
status; done
Redirecting to /bin/systemctl status munge.service
? munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor
preset: disabled)
   Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 36min ago
     Docs: man:munged(8)
 Main PID: 20613 (munged)
   CGroup: /system.slice/munge.service
           ??20613 /usr/sbin/munged

Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Stopped MUNGE
authentication service.
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Starting MUNGE
authentication service...
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Started MUNGE
authentication service.
Redirecting to /bin/systemctl status slurmd.service
? slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor
preset: disabled)
   Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 36min ago
 Main PID: 20637 (slurmd)
   CGroup: /system.slice/slurmd.service
           ??20637 /usr/sbin/slurmd -D

Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Started Slurm node
daemon.
Feb 02 15:30:47 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 7 for UID 0
Feb 02 15:31:46 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 8 for UID 0
Feb 02 15:33:43 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 9 for UID 0

Redirecting to /bin/systemctl status slurmctld.service
? slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled;
vendor preset: disabled)
   Active: active (running) since Tue 2021-02-02 13:21:11 IST; 3h 36min ago
 Main PID: 20660 (slurmctld)
   CGroup: /system.slice/slurmctld.service
           ??20660 /usr/sbin/slurmctld -D

Feb 02 13:21:11 smaster.calligotech.com systemd[1]: Started Slurm
controller daemon.
Redirecting to /bin/systemctl status slurmdbd.service
? slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled;
vendor preset: disabled)
   Active: active (running) since Tue 2021-02-02 16:29:11 IST; 28min ago
 Main PID: 24146 (slurmdbd)
   CGroup: /system.slice/slurmdbd.service
           ??24146 /usr/sbin/slurmdbd -D

Feb 02 16:29:11 smaster.calligotech.com systemd[1]: Started Slurm DBD
accounting daemon.
[root at smaster ~]# srun --ntasks=2 --label /bin/hostname
srun: job 22 queued and waiting for resources
srun: job 22 has been allocated resources
1: smaster.calligotech.com
0: smaster.calligotech.com
[root at smaster ~]#


However when i run the below command

[root at smaster ~]# sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
[root at smaster ~]#

and i have troubleshooted below steps

[root at smaster ~]# telnet localhost 6819
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
[root at smaster ~]#

[root at smaster ~]# mysql -p -u slurm slurm_acct_db
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 9
Server version: 10.1.48-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input
statement.

MariaDB [slurm_acct_db]> show tables;
Empty set (0.00 sec)

MariaDB [slurm_acct_db]>

Then i have added DBPort and restarted services
[root at smaster ~]# cat /etc/slurm/slurmdbd.conf
AuthType=auth/munge
DbdAddr=localhost
DbdHost=localhost
*DbdPort=6819*
SlurmUser=slurm
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePass=password
StorageUser=slurm
StorageLoc=slurm_acct_db
[root at smaster ~]#

[root at smaster ~]# for i in munge slurmd slurmctld slurmdbd; do service $i
status; done
Redirecting to /bin/systemctl status munge.service
? munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor
preset: disabled)
   Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 55min ago
     Docs: man:munged(8)
 Main PID: 20613 (munged)
   CGroup: /system.slice/munge.service
           ??20613 /usr/sbin/munged

Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Stopped MUNGE
authentication service.
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Starting MUNGE
authentication service...
Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Started MUNGE
authentication service.
Redirecting to /bin/systemctl status slurmd.service
? slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor
preset: disabled)
   Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 55min ago
 Main PID: 20637 (slurmd)
   CGroup: /system.slice/slurmd.service
           ??20637 /usr/sbin/slurmd -D

Feb 02 15:30:47 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 7 for UID 0
Feb 02 15:31:46 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 8 for UID 0
Feb 02 15:33:43 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 9 for UID 0
Feb 02 15:38:45 smaster.calligotech.com slurmd[20637]: slurmd: Launching
batch job 12 for UID 0

Redirecting to /bin/systemctl status slurmctld.service
? slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled;
vendor preset: disabled)
   Active: active (running) since Tue 2021-02-02 13:21:11 IST; 3h 55min ago
 Main PID: 20660 (slurmctld)
   CGroup: /system.slice/slurmctld.service
           ??20660 /usr/sbin/slurmctld -D

Feb 02 13:21:11 smaster.calligotech.com systemd[1]: Started Slurm
controller daemon.
Redirecting to /bin/systemctl status slurmdbd.service
? slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled;
vendor preset: disabled)
   Active: active (running) since Tue 2021-02-02 16:29:11 IST; 47min ago
 Main PID: 24146 (slurmdbd)
   CGroup: /system.slice/slurmdbd.service
           ??24146 /usr/sbin/slurmdbd -D

Feb 02 16:29:11 smaster.calligotech.com systemd[1]: Started Slurm DBD
accounting daemon.
[root at smaster ~]# ps -ef |grep slurm
root     20637     1  0 13:21 ?        00:00:00 /usr/sbin/slurmd -D
slurm    20660     1  0 13:21 ?        00:00:08 /usr/sbin/slurmctld -D
root     24146     1  0 16:29 ?        00:00:00 /usr/sbin/slurmdbd -D
root     25395 18378  0 17:17 pts/2    00:00:00 grep --color=auto slurm
[root at smaster ~]# sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
[root at smaster ~]#

[root at smaster ~]# tail /var/log/slurm/slurmdbd.log
[2021-02-02T17:16:01.913] error: mysql_real_connect failed: 2005 Unknown
MySQL server host 'smater' (-2)
[2021-02-02T17:16:01.913] error: The database must be up when starting the
MYSQL plugin.  Trying again in 5 seconds.
[2021-02-02T17:16:06.963] error: mysql_real_connect failed: 2005 Unknown
MySQL server host 'smater' (-2)
[2021-02-02T17:16:06.963] error: The database must be up when starting the
MYSQL plugin.  Trying again in 5 seconds.
[2021-02-02T17:16:12.083] error: mysql_real_connect failed: 2005 Unknown
MySQL server host 'smater' (-2)
[2021-02-02T17:16:12.083] error: The database must be up when starting the
MYSQL plugin.  Trying again in 5 seconds.
[2021-02-02T17:16:17.140] error: mysql_real_connect failed: 2005 Unknown
MySQL server host 'smater' (-2)
[2021-02-02T17:16:17.141] error: The database must be up when starting the
MYSQL plugin.  Trying again in 5 seconds.
[2021-02-02T17:16:22.804] error: mysql_real_connect failed: 2005 Unknown
MySQL server host 'smater' (-2)
[2021-02-02T17:16:22.804] error: The database must be up when starting the
MYSQL plugin.  Trying again in 5 seconds.
[root at smaster ~]#

Still the problem remains the same. Please help me to resolve this issue.

Regards,
Zain
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210202/f2348489/attachment-0001.htm>

------------------------------

Message: 2
Date: Tue, 2 Feb 2021 16:16:09 +0300
From: Benson Muite <benson_muite at emailplus.org>
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] Slurm - Munge configuration details
Message-ID: <bd36d545-4fd7-05ec-4a51-bb2743258b34 at emailplus.org>
Content-Type: text/plain; charset=utf-8; format=flowed

On 2/2/21 4:00 PM, Zainul Abiddin wrote:
> Hi Benson,
>
> I am not able to do passwordless ssh? between master and compute nodes
> using Munge service.
> when i am running below command , here it is asking for a password for
> the compute node.
>
> /Am I configuring properly or not, so I need clarity on this?/
>
> [root at smaster ~]# munge -n | ssh snode unmunge
> root at snode's password:
> STATUS: ? ? ? ? ? Success (0)
> ENCODE_HOST: smaster.calligotech.com
> <http://smaster.calligotech.com/>?(192.168.1.195<http://smaster.calligotech.com/%3e?(192.168.1.195>)
> ENCODE_TIME: ? ? ?2021-02-01 13:58:16 +0530 (1612168096)
> DECODE_TIME: ? ? ?2021-02-01 13:58:21 +0530 (1612168101)
> TTL: ? ? ? ? ? ? ?300
> CIPHER: ? ? ? ? ? aes128 (4)
> MAC: ? ? ? ? ? ? ?sha1 (3)
> ZIP: ? ? ? ? ? ? ?none (0)
> UID: ? ? ? ? ? ? ?root (0)
> GID: ? ? ? ? ? ? ?root (0)
> LENGTH: ? ? ? ? ? 0
>
> [root at smaster ~]#
>
> Regards,
> Zain
>
Hi Zain,

Perhaps try using the ipaddress instead of the hostname?

Also, are clocks synchronized? See
https://slurm.schedmd.com/quickstart_admin.html
Benson



End of slurm-users Digest, Vol 40, Issue 4
******************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210202/a3775444/attachment-0001.htm>


More information about the slurm-users mailing list