[slurm-users] DBD Reset

Thu Jun 16 00:10:17 UTC 2022

Hoping this is an easy answer.

My mysql instance somehow corrupted itself, and I’m having to purge and start over.
This is ok, because the data in there isn’t too valuable, and we aren’t making use of associations or anything like that yet (no AccountingStorageEnforce).

That said, I’ve decided to put the dbd’s mysql instance on my main database server, rather than in a small vm alongside the dbd.
Jobs are still submitting alright, and after adding the cluster back with `sacctmgr create cluster $cluster` it seems to have stopped the log firehose.
The issue I’m mainly seeing now is in the dbd logs:

> [2022-06-15T19:40:43.064] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
> [2022-06-15T19:40:43.065] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
> [2022-06-15T19:45:39.827] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
> [2022-06-15T19:48:01.038] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
> [2022-06-15T19:48:01.039] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
> [2022-06-15T19:48:38.104] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
> [2022-06-15T19:50:39.290] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
> [2022-06-15T19:55:39.769] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port

And if I run 
> $ sacctmgr show cluster
>     Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS
>  ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
>    $cluster                            0     0         1                                                                                           normal

I can see the ControlHost, ControlPort, and RPC are all missing.
So I’m not sure what I need to do to figure out how to effectively reset my dbd.
Also, $cluster in sacctmgr matches ClusterName=$cluster in my slurm.conf.

The only thing that has changed is the StorageHost in the dbd conf, and I made the database, user, and grant all on slurm_acct_db.*, on the new database server.
And I’ve verified that it has made tables, and that I can connect from the host with the correct credentials.

> mysql> show tables;
> +----------------------------------+
> | Tables_in_slurm_acct_db          |
> +----------------------------------+
> | acct_coord_table                 |
> | acct_table                       |
> | $cluster_assoc_table             |
> | $cluster_assoc_usage_day_table   |
> | $cluster_assoc_usage_hour_table  |
> | $cluster_assoc_usage_month_table |
> | $cluster_event_table             |
> | $cluster_job_table               |
> | $cluster_last_ran_table          |
> | $cluster_resv_table              |
> | $cluster_step_table              |
> | $cluster_suspend_table           |
> | $cluster_usage_day_table         |
> | $cluster_usage_hour_table        |
> | $cluster_usage_month_table       |
> | $cluster_wckey_table             |
> | $cluster_wckey_usage_day_table   |
> | $cluster_wckey_usage_hour_table  |
> | $cluster_wckey_usage_month_table |
> | clus_res_table                   |
> | cluster_table                    |
> | convert_version_table            |
> | federation_table                 |
> | qos_table                        |
> | res_table                        |
> | table_defs_table                 |
> | tres_table                       |
> | txn_table                        |
> | user_table                       |
> +----------------------------------+
> 29 rows in set (0.01 sec)

Any tips are appreciated.

21.08.7 and Ubuntu 20.04.
Slurmdbd and slurmctld(1) are running on one host, and slurmctld(2) is running on another host, and is the primary.

Thanks,
Reed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220615/3a1e360a/attachment-0001.htm>