[slurm-users] DBD Reset

Thu Jun 16 02:27:15 UTC 2022

It very much rang a bell!

I think there is also an scontrol command that you can use to show the actual running config (probably “show config”), which will include the defaults if you are seeing something that you don’t have specified in the config file.

Sent from my iPhone

On Jun 15, 2022, at 21:35, Reed Dier <reed.dier at focusvq.com> wrote:

 Well, you nailed it.

Honestly a little surprised it was working to begin with.

In the DBD conf
-#DbdPort=7031
+DbdPort=7031

And then in the slurm.conf
-#AccountingStoragePort=3306
+AccountingStoragePort=7031

I’m not sure how my slurm.conf showed the 3306 mysql port commented out.
I did confirm that the slurmdbd was listening on 6819 before,  so I assumed that the default would be 6819 on the dbd and the “client” (ctld or otherwise) side, but somehow that wasn’t the case?

Either way, I do feel like things are getting back to the right state.
So thank you so much for pointing me in the correct direction.

Thanks,
Reed

On Jun 15, 2022, at 7:50 PM, Ryan Novosielski <novosirj at rutgers.edu<mailto:novosirj at rutgers.edu>> wrote:

Apologies for not having more concrete information available when I’m replying to you, but I figured maybe having a fast hint might be better.

Have a look at how the various daemons communicate with one another. This sounds to me like a firewall thing between maybe the SlurmCtld and where the SlurmDBD is running right now, or vice-versa or something like that. The “scontrol show cluster” thing is a giveaway. That is populated dynamically, not pulled from a config file exactly.

I ran into this exact thing years ago, but can’t remember where the firewall was the issue.

Sent from my iPhone

On Jun 15, 2022, at 20:12, Reed Dier <reed.dier at focusvq.com<mailto:reed.dier at focusvq.com>> wrote:

 Hoping this is an easy answer.

My mysql instance somehow corrupted itself, and I’m having to purge and start over.
This is ok, because the data in there isn’t too valuable, and we aren’t making use of associations or anything like that yet (no AccountingStorageEnforce).

That said, I’ve decided to put the dbd’s mysql instance on my main database server, rather than in a small vm alongside the dbd.
Jobs are still submitting alright, and after adding the cluster back with `sacctmgr create cluster $cluster` it seems to have stopped the log firehose.
The issue I’m mainly seeing now is in the dbd logs:

[2022-06-15T19:40:43.064] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:40:43.065] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:45:39.827] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:48:01.038] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:48:01.039] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:48:38.104] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:50:39.290] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port
[2022-06-15T19:55:39.769] error: _add_registered_cluster: trying to register a cluster ($cluster) with no remote port

And if I run
$ sacctmgr show cluster
    Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS
 ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
   $cluster                            0     0         1                                                                                           normal

I can see the ControlHost, ControlPort, and RPC are all missing.
So I’m not sure what I need to do to figure out how to effectively reset my dbd.
Also, $cluster in sacctmgr matches ClusterName=$cluster in my slurm.conf.

The only thing that has changed is the StorageHost in the dbd conf, and I made the database, user, and grant all on slurm_acct_db.*, on the new database server.
And I’ve verified that it has made tables, and that I can connect from the host with the correct credentials.

mysql> show tables;
+----------------------------------+
| Tables_in_slurm_acct_db          |
+----------------------------------+
| acct_coord_table                 |
| acct_table                       |
| $cluster_assoc_table             |
| $cluster_assoc_usage_day_table   |
| $cluster_assoc_usage_hour_table  |
| $cluster_assoc_usage_month_table |
| $cluster_event_table             |
| $cluster_job_table               |
| $cluster_last_ran_table          |
| $cluster_resv_table              |
| $cluster_step_table              |
| $cluster_suspend_table           |
| $cluster_usage_day_table         |
| $cluster_usage_hour_table        |
| $cluster_usage_month_table       |
| $cluster_wckey_table             |
| $cluster_wckey_usage_day_table   |
| $cluster_wckey_usage_hour_table  |
| $cluster_wckey_usage_month_table |
| clus_res_table                   |
| cluster_table                    |
| convert_version_table            |
| federation_table                 |
| qos_table                        |
| res_table                        |
| table_defs_table                 |
| tres_table                       |
| txn_table                        |
| user_table                       |
+----------------------------------+
29 rows in set (0.01 sec)

Any tips are appreciated.

21.08.7 and Ubuntu 20.04.
Slurmdbd and slurmctld(1) are running on one host, and slurmctld(2) is running on another host, and is the primary.

Thanks,
Reed

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220616/7935e284/attachment-0001.htm>