[slurm-users] Problem with slurmctl communication with clurmdbd
Barbara Krašovec
barbara.krasovec at ijs.si
Wed Nov 29 07:46:34 MST 2017
Did you upgrade SLURM or is it a fresh install?
Are there any associations set? For instance, did you create the cluster with sacctmgr?
sacctmgr add cluster <name>
Is mariadb/mysql server running, is slurmdbd running? Is it working? Try a simple test, such as:
sacctmgr show user -s
If it was an upgrade, did you try to run the slurmdbd and slurmctld manuallly first:
slurmdbd -Dvvvvv
Then controller:
slurmctld -Dvvvvv
Which OS is that?
Is there a firewall/selinux/ACLs?
Cheers,
Barbara
> On 29 Nov 2017, at 15:19, Bruno Santos <bacmsantos at gmail.com> wrote:
>
> Thank you Barbara,
>
> Unfortunately, it does not seem to be a munge problem. Munge can successfully authenticate with the nodes.
>
> I have increased the verbosity level and restarted the slurmctld and now I am getting more information about this:
> Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port 6817 with slurmdbd.
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to localhost:6819: Initial RPC not DBD_INIT
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending PersistInit msg: No error
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to localhost:6819: Initial RPC not DBD_INIT
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending PersistInit msg: No error
> Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't have any association data from your database. The priority/multifactor plugin requires this information to run correctly. Please check your database connection and try again.
>
> The problem seems to somehow be related to slurmdbd?
> I am a bit lost at this point, to be honest.
>
> Best,
> Bruno
>
> On 29 November 2017 at 14:06, Barbara Krašovec <barbara.krasovec at ijs.si <mailto:barbara.krasovec at ijs.si>> wrote:
> Hello,
>
> does munge work?
> Try if decode works locally:
> munge -n | unmunge
> Try if decode works remotely:
> munge -n | ssh <somehost_in_cluster> unmunge
>
> It seems as munge keys do not match...
>
> See comments inline..
>
>> On 29 Nov 2017, at 14:40, Bruno Santos <bacmsantos at gmail.com <mailto:bacmsantos at gmail.com>> wrote:
>>
>> I actually just managed to figure that one out.
>>
>> The problem was that I had setup AccountingStoragePass=magic in the slurm.conf file while after re-reading the documentation it seems this is only needed if I have a different munge instance controlling the logins to the database, which I don't.
>> So commenting that line out seems to have worked however I am now getting a different error:
>> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port 6817 with slurmdbd.
>> Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to localhost:6819: Initial RPC not DBD_INIT
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed state.
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result 'exit-code'.
>>
>> My slurm.conf looks like this
>> # LOGGING AND ACCOUNTING
>> AccountingStorageHost=localhost
>> AccountingStorageLoc=slurm_db
>> #AccountingStoragePass=magic
>> #AccountingStoragePort=
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStorageUser=slurm
>> AccountingStoreJobComment=YES
>> ClusterName=research
>> JobCompType=jobcomp/none
>> JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_gather/none
>> SlurmctldDebug=3
>> SlurmdDebug=3
>
> You only need:
> AccountingStorageEnforce=associations,limits,qos
> AccountingStorageHost=<hostname>
> AccountingStorageType=accounting_storage/slurmdbd
>
> You can remove AccountingStorageLoc and AccountingStorageUser.
>
>
>>
>> And the slurdbd.conf like this:
>> ArchiveEvents=yes
>> ArchiveJobs=yes
>> ArchiveResvs=yes
>> ArchiveSteps=no
>> #ArchiveTXN=no
>> #ArchiveUsage=no
>> # Authentication info
>> AuthType=auth/munge
>> AuthInfo=/var/run/munge/munge.socket.2
>> #Database info
>> # slurmDBD info
>> DbdAddr=plantae
>> DbdHost=plantae
>> # Database info
>> StorageType=accounting_storage/mysql
>> StorageHost=localhost
>> SlurmUser=slurm
>> StoragePass=magic
>> StorageUser=slurm
>> StorageLoc=slurm_db
>>
>>
>> Thank you very much in advance.
>>
>> Best,
>> Bruno
>
> Cheers,
> Barbara
>
>>
>>
>> On 29 November 2017 at 13:28, Andy Riebs <andy.riebs at hpe.com <mailto:andy.riebs at hpe.com>> wrote:
>> It looks like you don't have the munged daemon running.
>>
>>
>> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>>> Hi everyone,
>>>
>>> I have set-up slurm to use slurm_db and all was working fine. However I had to change the slurm.conf to play with user priority and upon restarting the slurmctl is fails with the following messages below. It seems that somehow is trying to use the mysql password as a munge socket?
>>> Any idea how to solve it?
>>>
>>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port 6817 with slurmdbd.
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart with --num-threads=10
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed: Failed to access "magic": No such file or directory
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket communication error
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open: failed to send persistent connection init message to localhost:6819
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending PersistInit msg: Protocol authentication error
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart with --num-threads=10
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed: Failed to access "magic": No such file or directory
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket communication error
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open: failed to send persistent connection init message to localhost:6819
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending PersistInit msg: Protocol authentication error
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart with --num-threads=10
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed: Failed to access "magic": No such file or directory
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket communication error
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open: failed to send persistent connection init message to localhost:6819
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending PersistInit msg: Protocol authentication error
>>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't have any association data from your database. The priority/multifactor plugin requires this information to run correctly. Please check your database connection and try again.
>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered failed state.
>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with result 'exit-code'.
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171129/a6a5d327/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171129/a6a5d327/attachment-0001.sig>
More information about the slurm-users
mailing list