[slurm-users] Problem with slurmctl communication with clurmdbd

Bruno Santos bacmsantos at gmail.com
Wed Nov 29 09:26:25 MST 2017


Managed to do some more progress on this. The problem seems to be related
to somehow the service still linking to an older version of slurmdbd I had
installed with apt. I have now hopefully fully cleaned the old version but
when I try to start the service it is getting killed somehow. Any
suggestions?

[2017-11-29T16:15:16.778] debug3: Trying to load plugin
> /usr/local/lib/slurm/auth_munge.so
> [2017-11-29T16:15:16.778] debug:  Munge authentication plugin loaded
> [2017-11-29T16:15:16.778] debug3: Success.
> [2017-11-29T16:15:16.778] debug3: Trying to load plugin
> /usr/local/lib/slurm/accounting_storage_mysql.so
> [2017-11-29T16:15:16.780] debug2: mysql_connect() called for db slurm_db
> [2017-11-29T16:15:16.786] adding column federation after flags in table
> cluster_table
> [2017-11-29T16:15:16.786] adding column fed_id after federation in table
> cluster_table
> [2017-11-29T16:15:16.786] adding column fed_state after fed_id in table
> cluster_table
> [2017-11-29T16:15:16.786] adding column fed_weight after fed_state in
> table cluster_table
> [2017-11-29T16:15:16.786] debug:  Table cluster_table has changed.
> Updating...
> [2017-11-29T16:15:17.259] debug:  Table txn_table has changed.  Updating...
> [2017-11-29T16:15:17.781] debug:  Table tres_table has changed.
> Updating...
> [2017-11-29T16:15:18.325] debug:  Table acct_coord_table has changed.
> Updating...
> [2017-11-29T16:15:18.783] debug:  Table acct_table has changed.
> Updating...
> [2017-11-29T16:15:19.252] debug:  Table res_table has changed.  Updating...
> [2017-11-29T16:15:20.267] debug:  Table clus_res_table has changed.
> Updating...
> [2017-11-29T16:15:20.762] debug:  Table qos_table has changed.  Updating...
> [2017-11-29T16:15:21.272] debug:  Table user_table has changed.
> Updating...
> [2017-11-29T16:15:22.079] Accounting storage MYSQL plugin loaded
> [2017-11-29T16:15:22.080] debug3: Success.
> [2017-11-29T16:15:22.083] debug2: ArchiveDir        = /tmp
> [2017-11-29T16:15:22.083] debug2: ArchiveScript     = (null)
> [2017-11-29T16:15:22.083] debug2: AuthInfo          = (null)
> [2017-11-29T16:15:22.083] debug2: AuthType          = auth/munge
> [2017-11-29T16:15:22.083] debug2: CommitDelay       = 0
> [2017-11-29T16:15:22.083] debug2: DbdAddr           = 10.1.10.37
> [2017-11-29T16:15:22.083] debug2: DbdBackupHost     = (null)
> [2017-11-29T16:15:22.083] debug2: DbdHost           = plantae
> [2017-11-29T16:15:22.083] debug2: DbdPort           = 6819
> [2017-11-29T16:15:22.083] debug2: DebugFlags        = (null)
> [2017-11-29T16:15:22.083] debug2: DebugLevel        = 7
> [2017-11-29T16:15:22.083] debug2: DefaultQOS        = (null)
> [2017-11-29T16:15:22.083] debug2: LogFile           =
> /slurm/log/slurmdbd.log
> [2017-11-29T16:15:22.083] debug2: MessageTimeout    = 10
> [2017-11-29T16:15:22.083] debug2: PidFile           =
> /slurm/run/slurmdbd.pid
> [2017-11-29T16:15:22.083] debug2: PluginDir         = /usr/local/lib/slurm
> [2017-11-29T16:15:22.083] debug2: PrivateData       = none
> [2017-11-29T16:15:22.083] debug2: PurgeEventAfter   = NONE

[2017-11-29T16:15:22.083] debug2: PurgeJobAfter     = NONE
> [2017-11-29T16:15:22.083] debug2: PurgeResvAfter    = NONE
> [2017-11-29T16:15:22.083] debug2: PurgeStepAfter    = NONE
> [2017-11-29T16:15:22.083] debug2: PurgeSuspendAfter = NONE
> [2017-11-29T16:15:22.083] debug2: PurgeTXNAfter = NONE
> [2017-11-29T16:15:22.083] debug2: PurgeUsageAfter = NONE
> [2017-11-29T16:15:22.083] debug2: SlurmUser         = slurm(64030)
> [2017-11-29T16:15:22.083] debug2: StorageBackupHost = (null)
> [2017-11-29T16:15:22.083] debug2: StorageHost       = localhost
> [2017-11-29T16:15:22.083] debug2: StorageLoc        = slurm_db
> [2017-11-29T16:15:22.083] debug2: StoragePort       = 3306
> [2017-11-29T16:15:22.083] debug2: StorageType       =
> accounting_storage/mysql
> [2017-11-29T16:15:22.083] debug2: StorageUser       = slurm
> [2017-11-29T16:15:22.083] debug2: TCPTimeout        = 2
> [2017-11-29T16:15:22.083] debug2: TrackWCKey        = 0
> [2017-11-29T16:15:22.083] debug2: TrackSlurmctldDown= 0
> [2017-11-29T16:15:22.083] debug2: acct_storage_p_get_connection: request
> new connection 1
> [2017-11-29T16:15:22.086] slurmdbd version 17.02.9 started
> [2017-11-29T16:15:22.086] debug2: running rollup at Wed Nov 29 16:15:22
> 2017
> [2017-11-29T16:15:22.086] debug2: Everything rolled up
> [2017-11-29T16:16:46.798] Terminate signal (SIGINT or SIGTERM) received
> [2017-11-29T16:16:46.798] debug:  rpc_mgr shutting down
> [2017-11-29T16:16:46.799] debug3: starting mysql cleaning up
> [2017-11-29T16:16:46.799] debug3: finished mysql cleaning up




On 29 November 2017 at 15:13, Bruno Santos <bacmsantos at gmail.com> wrote:

> Hi Barbara,
>
> This is a fresh install. I have installed slurm from source on Debian
> stretch and now trying to set it up correctly.
> MariaDB is running for but I am confused about the database configuration.
> I followed a tutorial (I can no longer find it) that showed me how to
> create the database and give it to the slurm user on mysql. Haven't really
> done anything further than that as running anything return the same errors:
>
> root at plantae:~# sacctmgr show user -s
>> sacctmgr: error: slurm_persist_conn_open: Something happened with the
>> receiving/processing of the persistent connection init message to
>> localhost:6819: Initial RPC not DBD_INIT
>> sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
>> sacctmgr: error: slurm_persist_conn_open: Something happened with the
>> receiving/processing of the persistent connection init message to
>> localhost:6819: Initial RPC not DBD_INIT
>> sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
>> sacctmgr: error: slurm_persist_conn_open: Something happened with the
>> receiving/processing of the persistent connection init message to
>> localhost:6819: Initial RPC not DBD_INIT
>> sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
>> sacctmgr: error: slurmdbd: DBD_GET_USERS failure: No error
>>  Problem with query.
>
>
>
>
> On 29 November 2017 at 14:46, Barbara Krašovec <barbara.krasovec at ijs.si>
> wrote:
>
>> Did you upgrade SLURM or is it a fresh install?
>>
>> Are there any associations set? For instance, did you create the cluster
>> with sacctmgr?
>> sacctmgr add cluster <name>
>>
>> Is mariadb/mysql server running, is slurmdbd running? Is it working? Try
>> a simple test, such as:
>>
>> sacctmgr show user -s
>>
>> If it was an upgrade, did you try to run the slurmdbd and slurmctld
>> manuallly first:
>>
>> slurmdbd -Dvvvvv
>>
>> Then controller:
>>
>> slurmctld -Dvvvvv
>>
>> Which OS is that?
>> Is there a firewall/selinux/ACLs?
>>
>> Cheers,
>> Barbara
>>
>>
>> On 29 Nov 2017, at 15:19, Bruno Santos <bacmsantos at gmail.com> wrote:
>>
>> Thank you Barbara,
>>
>> Unfortunately, it does not seem to be a munge problem. Munge can
>> successfully authenticate with the nodes.
>>
>> I have increased the verbosity level and restarted the slurmctld and now
>> I am getting more information about this:
>>
>>> Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port
>>>> 6817 with slurmdbd.
>>>
>>> Nov 29 14:08:16 plantae slurmctld[30340]: error:
>>>> slurm_persist_conn_open: Something happened with the receiving/processing
>>>> of the persistent connection init message to localhost:6819: Initial RPC
>>>> not DBD_INIT
>>>
>>> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending
>>>> PersistInit msg: No error
>>>
>>> Nov 29 14:08:16 plantae slurmctld[30340]: error:
>>>> slurm_persist_conn_open: Something happened with the receiving/processing
>>>> of the persistent connection init message to localhost:6819: Initial RPC
>>>> not DBD_INIT
>>>
>>> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending
>>>> PersistInit msg: No error
>>>
>>> Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't
>>>> have any association data from your database.  The priority/multifactor
>>>> plugin requires this information to run correctly.  Please check your
>>>> database connection and try again.
>>>
>>>
>> The problem seems to somehow be related to slurmdbd?
>> I am a bit lost at this point, to be honest.
>>
>> Best,
>> Bruno
>>
>> On 29 November 2017 at 14:06, Barbara Krašovec <barbara.krasovec at ijs.si>
>> wrote:
>>
>>> Hello,
>>>
>>> does munge work?
>>> Try if decode works locally:
>>> munge -n | unmunge
>>> Try if decode works remotely:
>>> munge -n | ssh <somehost_in_cluster> unmunge
>>>
>>> It seems as munge keys do not match...
>>>
>>> See comments inline..
>>>
>>> On 29 Nov 2017, at 14:40, Bruno Santos <bacmsantos at gmail.com> wrote:
>>>
>>> I actually just managed to figure that one out.
>>>
>>> The problem was that I had setup AccountingStoragePass=magic in the
>>> slurm.conf file while after re-reading the documentation it seems this is
>>> only needed if I have a different munge instance controlling the logins to
>>> the database, which I don't.
>>> So commenting that line out seems to have worked however I am now
>>> getting a different error:
>>>
>>>> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port
>>>> 6817 with slurmdbd.
>>>> Nov 29 13:19:20 plantae slurmctld[29984]: error:
>>>> slurm_persist_conn_open: Something happened with the receiving/processing
>>>> of the persistent connection init message to localhost:6819: Initial RPC
>>>> not DBD_INIT
>>>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process
>>>> exited, code=exited, status=1/FAILURE
>>>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered
>>>> failed state.
>>>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with
>>>> result 'exit-code'.
>>>
>>>
>>> My slurm.conf looks like this
>>>
>>>> # LOGGING AND ACCOUNTING
>>>> AccountingStorageHost=localhost
>>>> AccountingStorageLoc=slurm_db
>>>> #AccountingStoragePass=magic
>>>> #AccountingStoragePort=
>>>> AccountingStorageType=accounting_storage/slurmdbd
>>>> AccountingStorageUser=slurm
>>>> AccountingStoreJobComment=YES
>>>> ClusterName=research
>>>> JobCompType=jobcomp/none
>>>> JobAcctGatherFrequency=30
>>>> JobAcctGatherType=jobacct_gather/none
>>>> SlurmctldDebug=3
>>>> SlurmdDebug=3
>>>
>>>
>>> You only need:
>>> AccountingStorageEnforce=associations,limits,qos
>>> AccountingStorageHost=<hostname>
>>> AccountingStorageType=accounting_storage/slurmdbd
>>>
>>> You can remove AccountingStorageLoc and AccountingStorageUser.
>>>
>>>
>>>
>>> And the slurdbd.conf like this:
>>>
>>>> ArchiveEvents=yes
>>>> ArchiveJobs=yes
>>>> ArchiveResvs=yes
>>>> ArchiveSteps=no
>>>> #ArchiveTXN=no
>>>> #ArchiveUsage=no
>>>> # Authentication info
>>>> AuthType=auth/munge
>>>> AuthInfo=/var/run/munge/munge.socket.2
>>>
>>> #Database info
>>>> # slurmDBD info
>>>> DbdAddr=plantae
>>>> DbdHost=plantae
>>>> # Database info
>>>> StorageType=accounting_storage/mysql
>>>> StorageHost=localhost
>>>> SlurmUser=slurm
>>>> StoragePass=magic
>>>> StorageUser=slurm
>>>> StorageLoc=slurm_db
>>>
>>>
>>>
>>> Thank you very much in advance.
>>>
>>> Best,
>>> Bruno
>>>
>>>
>>> Cheers,
>>> Barbara
>>>
>>>
>>>
>>> On 29 November 2017 at 13:28, Andy Riebs <andy.riebs at hpe.com> wrote:
>>>
>>>> It looks like you don't have the munged daemon running.
>>>>
>>>>
>>>> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> I have set-up slurm to use slurm_db and all was working fine. However I
>>>> had to change the slurm.conf to play with user priority and upon restarting
>>>> the slurmctl is fails with the following messages below. It seems that
>>>> somehow is trying to use the mysql password as a munge socket?
>>>> Any idea how to solve it?
>>>>
>>>>
>>>>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at
>>>>> port 6817 with slurmdbd.
>>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up,
>>>>> restart with --num-threads=10
>>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed:
>>>>> Failed to access "magic": No such file or directory
>>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication:
>>>>> Socket communication error
>>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error:
>>>>> slurm_persist_conn_open: failed to send persistent connection init message
>>>>> to localhost:6819
>>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending
>>>>> PersistInit msg: Protocol authentication error
>>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up,
>>>>> restart with --num-threads=10
>>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed:
>>>>> Failed to access "magic": No such file or directory
>>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication:
>>>>> Socket communication error
>>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error:
>>>>> slurm_persist_conn_open: failed to send persistent connection init message
>>>>> to localhost:6819
>>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending
>>>>> PersistInit msg: Protocol authentication error
>>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up,
>>>>> restart with --num-threads=10
>>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed:
>>>>> Failed to access "magic": No such file or directory
>>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication:
>>>>> Socket communication error
>>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error:
>>>>> slurm_persist_conn_open: failed to send persistent connection init message
>>>>> to localhost:6819
>>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending
>>>>> PersistInit msg: Protocol authentication error
>>>>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't
>>>>> have any association data from your database.  The priority/multifactor
>>>>> plugin requires this information to run correctly.  Please check your
>>>>> database connection and try again.
>>>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process
>>>>> exited, code=exited, status=1/FAILURE
>>>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered
>>>>> failed state.
>>>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with
>>>>> result 'exit-code'.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171129/79848672/attachment-0001.html>


More information about the slurm-users mailing list