[slurm-users] Problem with slurmctl communication with clurmdbd

Bruno Santos bacmsantos at gmail.com
Wed Nov 29 06:40:17 MST 2017


I actually just managed to figure that one out.

The problem was that I had setup AccountingStoragePass=magic in the
slurm.conf file while after re-reading the documentation it seems this is
only needed if I have a different munge instance controlling the logins to
the database, which I don't.
So commenting that line out seems to have worked however I am now getting a
different error:

> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port
> 6817 with slurmdbd.
> Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open:
> Something happened with the receiving/processing of the persistent
> connection init message to localhost:6819: Initial RPC not DBD_INIT
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process
> exited, code=exited, status=1/FAILURE
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed
> state.
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result
> 'exit-code'.


My slurm.conf looks like this

> # LOGGING AND ACCOUNTING
> AccountingStorageHost=localhost
> AccountingStorageLoc=slurm_db
> #AccountingStoragePass=magic
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageUser=slurm
> AccountingStoreJobComment=YES
> ClusterName=research
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=3
> SlurmdDebug=3


And the slurdbd.conf like this:

> ArchiveEvents=yes
> ArchiveJobs=yes
> ArchiveResvs=yes
> ArchiveSteps=no
> #ArchiveTXN=no
> #ArchiveUsage=no
> # Authentication info
> AuthType=auth/munge
> AuthInfo=/var/run/munge/munge.socket.2

#Database info
> # slurmDBD info
> DbdAddr=plantae
> DbdHost=plantae
> # Database info
> StorageType=accounting_storage/mysql
> StorageHost=localhost
> SlurmUser=slurm
> StoragePass=magic
> StorageUser=slurm
> StorageLoc=slurm_db



Thank you very much in advance.

Best,
Bruno


On 29 November 2017 at 13:28, Andy Riebs <andy.riebs at hpe.com> wrote:

> It looks like you don't have the munged daemon running.
>
>
> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>
> Hi everyone,
>
> I have set-up slurm to use slurm_db and all was working fine. However I
> had to change the slurm.conf to play with user priority and upon restarting
> the slurmctl is fails with the following messages below. It seems that
> somehow is trying to use the mysql password as a munge socket?
> Any idea how to solve it?
>
>
>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port
>> 6817 with slurmdbd.
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart
>> with --num-threads=10
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed:
>> Failed to access "magic": No such file or directory
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket
>> communication error
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open:
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart
>> with --num-threads=10
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed:
>> Failed to access "magic": No such file or directory
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket
>> communication error
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open:
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart
>> with --num-threads=10
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed:
>> Failed to access "magic": No such file or directory
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket
>> communication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open:
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't
>> have any association data from your database.  The priority/multifactor
>> plugin requires this information to run correctly.  Please check your
>> database connection and try again.
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process
>> exited, code=exited, status=1/FAILURE
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered
>> failed state.
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with result
>> 'exit-code'.
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171129/c93aa7cd/attachment-0001.html>


More information about the slurm-users mailing list