[slurm-users] DBD Reset
Ryan Novosielski
novosirj at rutgers.edu
Thu Jun 16 00:50:00 UTC 2022
Apologies for not having more concrete information available when I’m replying to you, but I figured maybe having a fast hint might be better.
Have a look at how the various daemons communicate with one another. This sounds to me like a firewall thing between maybe the SlurmCtld and where the SlurmDBD is running right now, or vice-versa or something like that. The “scontrol show cluster” thing is a giveaway. That is populated dynamically, not pulled from a config file exactly.
I ran into this exact thing years ago, but can’t remember where the firewall was the issue.
Sent from my iPhone
On Jun 15, 2022, at 20:12, Reed Dier <reed.dier at focusvq.com> wrote:
Hoping this is an easy answer.
My mysql instance somehow corrupted itself, and I’m having to purge and start over.
This is ok, because the data in there isn’t too valuable, and we aren’t making use of associations or anything like that yet (no AccountingStorageEnforce).
That said, I’ve decided to put the dbd’s mysql instance on my main database server, rather than in a small vm alongside the dbd.
Jobs are still submitting alright, and after adding the cluster back with `sacctmgr create cluster $cluster` it seems to have stopped the log firehose.
The issue I’m mainly seeing now is in the dbd logs:
[2022-06-15T19:40:43.064] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:40:43.065] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:45:39.827] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:48:01.038] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:48:01.039] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:48:38.104] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:50:39.290] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:55:39.769] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
And if I run
$ sacctmgr show cluster
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
$cluster 0 0 1 normal
I can see the ControlHost, ControlPort, and RPC are all missing.
So I’m not sure what I need to do to figure out how to effectively reset my dbd.
Also, $cluster in sacctmgr matches ClusterName=$cluster in my slurm.conf.
The only thing that has changed is the StorageHost in the dbd conf, and I made the database, user, and grant all on slurm_acct_db.*, on the new database server.
And I’ve verified that it has made tables, and that I can connect from the host with the correct credentials.
mysql> show tables;
+----------------------------------+
| Tables_in_slurm_acct_db |
+----------------------------------+
| acct_coord_table |
| acct_table |
| $cluster_assoc_table |
| $cluster_assoc_usage_day_table |
| $cluster_assoc_usage_hour_table |
| $cluster_assoc_usage_month_table |
| $cluster_event_table |
| $cluster_job_table |
| $cluster_last_ran_table |
| $cluster_resv_table |
| $cluster_step_table |
| $cluster_suspend_table |
| $cluster_usage_day_table |
| $cluster_usage_hour_table |
| $cluster_usage_month_table |
| $cluster_wckey_table |
| $cluster_wckey_usage_day_table |
| $cluster_wckey_usage_hour_table |
| $cluster_wckey_usage_month_table |
| clus_res_table |
| cluster_table |
| convert_version_table |
| federation_table |
| qos_table |
| res_table |
| table_defs_table |
| tres_table |
| txn_table |
| user_table |
+----------------------------------+
29 rows in set (0.01 sec)
Any tips are appreciated.
21.08.7 and Ubuntu 20.04.
Slurmdbd and slurmctld(1) are running on one host, and slurmctld(2) is running on another host, and is the primary.
Thanks,
Reed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220616/14e3cdd3/attachment.htm>
More information about the slurm-users
mailing list