[slurm-users] Problem with slurmctl communication with clurmdbd

Barbara KraĊĦovec barbara.krasovec at ijs.si
Wed Nov 29 07:06:57 MST 2017


Hello,

does munge work?
Try if decode works locally:
munge -n | unmunge
Try if decode works remotely:
munge -n | ssh <somehost_in_cluster> unmunge

It seems as munge keys do not match...

See comments inline..

> On 29 Nov 2017, at 14:40, Bruno Santos <bacmsantos at gmail.com> wrote:
> 
> I actually just managed to figure that one out.
> 
> The problem was that I had setup AccountingStoragePass=magic in the slurm.conf file while after re-reading the documentation it seems this is only needed if I have a different munge instance controlling the logins to the database, which I don't.
> So commenting that line out seems to have worked however I am now getting a different error:
> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port 6817 with slurmdbd.
> Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to localhost:6819: Initial RPC not DBD_INIT
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed state.
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result 'exit-code'.
> 
> My slurm.conf looks like this
> # LOGGING AND ACCOUNTING
> AccountingStorageHost=localhost
> AccountingStorageLoc=slurm_db
> #AccountingStoragePass=magic
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageUser=slurm
> AccountingStoreJobComment=YES
> ClusterName=research
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=3
> SlurmdDebug=3

You only need:
AccountingStorageEnforce=associations,limits,qos
AccountingStorageHost=<hostname>
AccountingStorageType=accounting_storage/slurmdbd

You can remove AccountingStorageLoc and AccountingStorageUser.


> 
> And the slurdbd.conf like this:
> ArchiveEvents=yes
> ArchiveJobs=yes
> ArchiveResvs=yes
> ArchiveSteps=no
> #ArchiveTXN=no
> #ArchiveUsage=no
> # Authentication info
> AuthType=auth/munge
> AuthInfo=/var/run/munge/munge.socket.2
> #Database info
> # slurmDBD info
> DbdAddr=plantae
> DbdHost=plantae
> # Database info
> StorageType=accounting_storage/mysql
> StorageHost=localhost
> SlurmUser=slurm
> StoragePass=magic
> StorageUser=slurm
> StorageLoc=slurm_db
> 
> 
> Thank you very much in advance.
> 
> Best,
> Bruno

Cheers,
Barbara

> 
> 
> On 29 November 2017 at 13:28, Andy Riebs <andy.riebs at hpe.com <mailto:andy.riebs at hpe.com>> wrote:
> It looks like you don't have the munged daemon running.
> 
> 
> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>> Hi everyone,
>> 
>> I have set-up slurm to use slurm_db and all was working fine. However I had to change the slurm.conf to play with user priority and upon restarting the slurmctl is fails with the following messages below. It seems that somehow is trying to use the mysql password as a munge socket?
>> Any idea how to solve it?
>> 
>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port 6817 with slurmdbd.
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart with --num-threads=10
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed: Failed to access "magic": No such file or directory
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket communication error
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open: failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending PersistInit msg: Protocol authentication error
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart with --num-threads=10
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed: Failed to access "magic": No such file or directory
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket communication error
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open: failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending PersistInit msg: Protocol authentication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart with --num-threads=10
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed: Failed to access "magic": No such file or directory
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket communication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open: failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending PersistInit msg: Protocol authentication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't have any association data from your database.  The priority/multifactor plugin requires this information to run correctly.  Please check your database connection and try again.
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered failed state.
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with result 'exit-code'.
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171129/14cbe22c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171129/14cbe22c/attachment-0001.sig>


More information about the slurm-users mailing list