[slurm-users] Problem with slurmctl communication with clurmdbd

Bruno Santos bacmsantos at gmail.com
Wed Nov 29 08:13:51 MST 2017


Hi Barbara,

This is a fresh install. I have installed slurm from source on Debian
stretch and now trying to set it up correctly.
MariaDB is running for but I am confused about the database configuration.
I followed a tutorial (I can no longer find it) that showed me how to
create the database and give it to the slurm user on mysql. Haven't really
done anything further than that as running anything return the same errors:

root at plantae:~# sacctmgr show user -s
> sacctmgr: error: slurm_persist_conn_open: Something happened with the
> receiving/processing of the persistent connection init message to
> localhost:6819: Initial RPC not DBD_INIT
> sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
> sacctmgr: error: slurm_persist_conn_open: Something happened with the
> receiving/processing of the persistent connection init message to
> localhost:6819: Initial RPC not DBD_INIT
> sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
> sacctmgr: error: slurm_persist_conn_open: Something happened with the
> receiving/processing of the persistent connection init message to
> localhost:6819: Initial RPC not DBD_INIT
> sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
> sacctmgr: error: slurmdbd: DBD_GET_USERS failure: No error
>  Problem with query.




On 29 November 2017 at 14:46, Barbara Krašovec <barbara.krasovec at ijs.si>
wrote:

> Did you upgrade SLURM or is it a fresh install?
>
> Are there any associations set? For instance, did you create the cluster
> with sacctmgr?
> sacctmgr add cluster <name>
>
> Is mariadb/mysql server running, is slurmdbd running? Is it working? Try a
> simple test, such as:
>
> sacctmgr show user -s
>
> If it was an upgrade, did you try to run the slurmdbd and slurmctld
> manuallly first:
>
> slurmdbd -Dvvvvv
>
> Then controller:
>
> slurmctld -Dvvvvv
>
> Which OS is that?
> Is there a firewall/selinux/ACLs?
>
> Cheers,
> Barbara
>
>
> On 29 Nov 2017, at 15:19, Bruno Santos <bacmsantos at gmail.com> wrote:
>
> Thank you Barbara,
>
> Unfortunately, it does not seem to be a munge problem. Munge can
> successfully authenticate with the nodes.
>
> I have increased the verbosity level and restarted the slurmctld and now I
> am getting more information about this:
>
>> Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port
>>> 6817 with slurmdbd.
>>
>> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open:
>>> Something happened with the receiving/processing of the persistent
>>> connection init message to localhost:6819: Initial RPC not DBD_INIT
>>
>> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending
>>> PersistInit msg: No error
>>
>> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open:
>>> Something happened with the receiving/processing of the persistent
>>> connection init message to localhost:6819: Initial RPC not DBD_INIT
>>
>> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending
>>> PersistInit msg: No error
>>
>> Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't
>>> have any association data from your database.  The priority/multifactor
>>> plugin requires this information to run correctly.  Please check your
>>> database connection and try again.
>>
>>
> The problem seems to somehow be related to slurmdbd?
> I am a bit lost at this point, to be honest.
>
> Best,
> Bruno
>
> On 29 November 2017 at 14:06, Barbara Krašovec <barbara.krasovec at ijs.si>
> wrote:
>
>> Hello,
>>
>> does munge work?
>> Try if decode works locally:
>> munge -n | unmunge
>> Try if decode works remotely:
>> munge -n | ssh <somehost_in_cluster> unmunge
>>
>> It seems as munge keys do not match...
>>
>> See comments inline..
>>
>> On 29 Nov 2017, at 14:40, Bruno Santos <bacmsantos at gmail.com> wrote:
>>
>> I actually just managed to figure that one out.
>>
>> The problem was that I had setup AccountingStoragePass=magic in the
>> slurm.conf file while after re-reading the documentation it seems this is
>> only needed if I have a different munge instance controlling the logins to
>> the database, which I don't.
>> So commenting that line out seems to have worked however I am now getting
>> a different error:
>>
>>> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port
>>> 6817 with slurmdbd.
>>> Nov 29 13:19:20 plantae slurmctld[29984]: error:
>>> slurm_persist_conn_open: Something happened with the receiving/processing
>>> of the persistent connection init message to localhost:6819: Initial RPC
>>> not DBD_INIT
>>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process
>>> exited, code=exited, status=1/FAILURE
>>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered
>>> failed state.
>>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with
>>> result 'exit-code'.
>>
>>
>> My slurm.conf looks like this
>>
>>> # LOGGING AND ACCOUNTING
>>> AccountingStorageHost=localhost
>>> AccountingStorageLoc=slurm_db
>>> #AccountingStoragePass=magic
>>> #AccountingStoragePort=
>>> AccountingStorageType=accounting_storage/slurmdbd
>>> AccountingStorageUser=slurm
>>> AccountingStoreJobComment=YES
>>> ClusterName=research
>>> JobCompType=jobcomp/none
>>> JobAcctGatherFrequency=30
>>> JobAcctGatherType=jobacct_gather/none
>>> SlurmctldDebug=3
>>> SlurmdDebug=3
>>
>>
>> You only need:
>> AccountingStorageEnforce=associations,limits,qos
>> AccountingStorageHost=<hostname>
>> AccountingStorageType=accounting_storage/slurmdbd
>>
>> You can remove AccountingStorageLoc and AccountingStorageUser.
>>
>>
>>
>> And the slurdbd.conf like this:
>>
>>> ArchiveEvents=yes
>>> ArchiveJobs=yes
>>> ArchiveResvs=yes
>>> ArchiveSteps=no
>>> #ArchiveTXN=no
>>> #ArchiveUsage=no
>>> # Authentication info
>>> AuthType=auth/munge
>>> AuthInfo=/var/run/munge/munge.socket.2
>>
>> #Database info
>>> # slurmDBD info
>>> DbdAddr=plantae
>>> DbdHost=plantae
>>> # Database info
>>> StorageType=accounting_storage/mysql
>>> StorageHost=localhost
>>> SlurmUser=slurm
>>> StoragePass=magic
>>> StorageUser=slurm
>>> StorageLoc=slurm_db
>>
>>
>>
>> Thank you very much in advance.
>>
>> Best,
>> Bruno
>>
>>
>> Cheers,
>> Barbara
>>
>>
>>
>> On 29 November 2017 at 13:28, Andy Riebs <andy.riebs at hpe.com> wrote:
>>
>>> It looks like you don't have the munged daemon running.
>>>
>>>
>>> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>>>
>>> Hi everyone,
>>>
>>> I have set-up slurm to use slurm_db and all was working fine. However I
>>> had to change the slurm.conf to play with user priority and upon restarting
>>> the slurmctl is fails with the following messages below. It seems that
>>> somehow is trying to use the mysql password as a munge socket?
>>> Any idea how to solve it?
>>>
>>>
>>>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port
>>>> 6817 with slurmdbd.
>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up,
>>>> restart with --num-threads=10
>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed:
>>>> Failed to access "magic": No such file or directory
>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket
>>>> communication error
>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error:
>>>> slurm_persist_conn_open: failed to send persistent connection init message
>>>> to localhost:6819
>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending
>>>> PersistInit msg: Protocol authentication error
>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up,
>>>> restart with --num-threads=10
>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed:
>>>> Failed to access "magic": No such file or directory
>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket
>>>> communication error
>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error:
>>>> slurm_persist_conn_open: failed to send persistent connection init message
>>>> to localhost:6819
>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending
>>>> PersistInit msg: Protocol authentication error
>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up,
>>>> restart with --num-threads=10
>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed:
>>>> Failed to access "magic": No such file or directory
>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket
>>>> communication error
>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error:
>>>> slurm_persist_conn_open: failed to send persistent connection init message
>>>> to localhost:6819
>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending
>>>> PersistInit msg: Protocol authentication error
>>>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't
>>>> have any association data from your database.  The priority/multifactor
>>>> plugin requires this information to run correctly.  Please check your
>>>> database connection and try again.
>>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process
>>>> exited, code=exited, status=1/FAILURE
>>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered
>>>> failed state.
>>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with
>>>> result 'exit-code'.
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171129/a1af03ff/attachment-0001.html>


More information about the slurm-users mailing list