[slurm-users] Problem with slurmctl communication with clurmdbd
Bruno Santos
bacmsantos at gmail.com
Wed Nov 29 07:19:46 MST 2017
Thank you Barbara,
Unfortunately, it does not seem to be a munge problem. Munge can
successfully authenticate with the nodes.
I have increased the verbosity level and restarted the slurmctld and now I
am getting more information about this:
> Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port
>> 6817 with slurmdbd.
>
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open:
>> Something happened with the receiving/processing of the persistent
>> connection init message to localhost:6819: Initial RPC not DBD_INIT
>
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending
>> PersistInit msg: No error
>
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open:
>> Something happened with the receiving/processing of the persistent
>> connection init message to localhost:6819: Initial RPC not DBD_INIT
>
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending
>> PersistInit msg: No error
>
> Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't have
>> any association data from your database. The priority/multifactor plugin
>> requires this information to run correctly. Please check your database
>> connection and try again.
>
>
The problem seems to somehow be related to slurmdbd?
I am a bit lost at this point, to be honest.
Best,
Bruno
On 29 November 2017 at 14:06, Barbara KraĊĦovec <barbara.krasovec at ijs.si>
wrote:
> Hello,
>
> does munge work?
> Try if decode works locally:
> munge -n | unmunge
> Try if decode works remotely:
> munge -n | ssh <somehost_in_cluster> unmunge
>
> It seems as munge keys do not match...
>
> See comments inline..
>
> On 29 Nov 2017, at 14:40, Bruno Santos <bacmsantos at gmail.com> wrote:
>
> I actually just managed to figure that one out.
>
> The problem was that I had setup AccountingStoragePass=magic in the
> slurm.conf file while after re-reading the documentation it seems this is
> only needed if I have a different munge instance controlling the logins to
> the database, which I don't.
> So commenting that line out seems to have worked however I am now getting
> a different error:
>
>> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port
>> 6817 with slurmdbd.
>> Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open:
>> Something happened with the receiving/processing of the persistent
>> connection init message to localhost:6819: Initial RPC not DBD_INIT
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process
>> exited, code=exited, status=1/FAILURE
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered
>> failed state.
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result
>> 'exit-code'.
>
>
> My slurm.conf looks like this
>
>> # LOGGING AND ACCOUNTING
>> AccountingStorageHost=localhost
>> AccountingStorageLoc=slurm_db
>> #AccountingStoragePass=magic
>> #AccountingStoragePort=
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStorageUser=slurm
>> AccountingStoreJobComment=YES
>> ClusterName=research
>> JobCompType=jobcomp/none
>> JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_gather/none
>> SlurmctldDebug=3
>> SlurmdDebug=3
>
>
> You only need:
> AccountingStorageEnforce=associations,limits,qos
> AccountingStorageHost=<hostname>
> AccountingStorageType=accounting_storage/slurmdbd
>
> You can remove AccountingStorageLoc and AccountingStorageUser.
>
>
>
> And the slurdbd.conf like this:
>
>> ArchiveEvents=yes
>> ArchiveJobs=yes
>> ArchiveResvs=yes
>> ArchiveSteps=no
>> #ArchiveTXN=no
>> #ArchiveUsage=no
>> # Authentication info
>> AuthType=auth/munge
>> AuthInfo=/var/run/munge/munge.socket.2
>
> #Database info
>> # slurmDBD info
>> DbdAddr=plantae
>> DbdHost=plantae
>> # Database info
>> StorageType=accounting_storage/mysql
>> StorageHost=localhost
>> SlurmUser=slurm
>> StoragePass=magic
>> StorageUser=slurm
>> StorageLoc=slurm_db
>
>
>
> Thank you very much in advance.
>
> Best,
> Bruno
>
>
> Cheers,
> Barbara
>
>
>
> On 29 November 2017 at 13:28, Andy Riebs <andy.riebs at hpe.com> wrote:
>
>> It looks like you don't have the munged daemon running.
>>
>>
>> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>>
>> Hi everyone,
>>
>> I have set-up slurm to use slurm_db and all was working fine. However I
>> had to change the slurm.conf to play with user priority and upon restarting
>> the slurmctl is fails with the following messages below. It seems that
>> somehow is trying to use the mysql password as a munge socket?
>> Any idea how to solve it?
>>
>>
>>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port
>>> 6817 with slurmdbd.
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up,
>>> restart with --num-threads=10
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed:
>>> Failed to access "magic": No such file or directory
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket
>>> communication error
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error:
>>> slurm_persist_conn_open: failed to send persistent connection init message
>>> to localhost:6819
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending
>>> PersistInit msg: Protocol authentication error
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up,
>>> restart with --num-threads=10
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed:
>>> Failed to access "magic": No such file or directory
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket
>>> communication error
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error:
>>> slurm_persist_conn_open: failed to send persistent connection init message
>>> to localhost:6819
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending
>>> PersistInit msg: Protocol authentication error
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up,
>>> restart with --num-threads=10
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed:
>>> Failed to access "magic": No such file or directory
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket
>>> communication error
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error:
>>> slurm_persist_conn_open: failed to send persistent connection init message
>>> to localhost:6819
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending
>>> PersistInit msg: Protocol authentication error
>>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't
>>> have any association data from your database. The priority/multifactor
>>> plugin requires this information to run correctly. Please check your
>>> database connection and try again.
>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process
>>> exited, code=exited, status=1/FAILURE
>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered
>>> failed state.
>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with
>>> result 'exit-code'.
>>
>>
>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171129/01ef6a02/attachment.html>
More information about the slurm-users
mailing list