[slurm-users] Slurmctl seg faulting
Luke Sudbery
l.r.sudbery at bham.ac.uk
Thu Jul 7 11:28:08 UTC 2022
Hello,
After a restart this morning*, slurmctl is starting but then failing before it is up and running:
slurmctld: debug2: user user1 default acct is account1
slurmctld: debug2: assoc 68939(account1, user2) has direct parent of 26369(account1, (null))
slurmctld: debug2: user user2 default acct is account1
slurmctld: debug2: assoc 26396(account1, user3) has direct parent of 26369(account1, (null))
slurmctld: debug2: user user3 default acct is account1
slurmctld: debug2: assoc 26390(account1, user4) has direct parent of 26369(account1, (null))
slurmctld: debug2: user user4 default acct is account1
slurmctld: debug2: assoc 26342(account2, (null)) has direct parent of 3(root, (null))
slurmctld: debug2: assoc 76299(account2, rq-worker) has direct parent of 26342(account2, (null))
Segmentation fault (core dumped)
[root at bb-er-slurm01 state]#
Which give us very little to go on.
Earlier, there was a long list of
slurmctld: error: assoc 62346 doesn't have access to it's default qos 'qos1'
slurmctld: error: assoc 61271 doesn't have access to it's default qos 'qos1'
slurmctld: error: assoc 61221 doesn't have access to it's default qos 'qos1'
slurmctld: error: assoc 61215 doesn't have access to it's default qos 'qos1'
Shortly before the segfault. We have cleared those, by setting correct default qos with sacctmgr, but the controller is still seg faulting.
*The restart was initiated by our HA cluster (pacemaker/corosync) and at one point 2 copies of slurmdbd may have been running. But the second copy appeared to detect this and back off? Not sure:
Jul 07 09:06:47 bb-aw-slurm01.bear.cluster systemd[1]: Started Cluster Controlled slurmdbd.
Jul 07 09:06:47 bb-aw-slurm01.bear.cluster slurmdbd[212307]: debug: Log file re-opened
Jul 07 09:06:47 bb-aw-slurm01.bear.cluster slurmdbd[212307]: pidfile not locked, assuming no running daemon
Jul 07 09:06:47 bb-aw-slurm01.bear.cluster slurmdbd[212307]: debug: auth/munge: init: Munge authentication plugin loaded
Jul 07 09:06:52 bb-aw-slurm01.bear.cluster slurmdbd[212307]: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 10.3.32-MariaDB
Jul 07 09:06:52 bb-aw-slurm01.bear.cluster slurmdbd[212307]: error: _mysql_query_internal: deadlock detected attempt 1/10: 1213 WSREP replication failed. Check your wsrep connection state and retry the query.
Jul 07 09:06:52 bb-aw-slurm01.bear.cluster slurmdbd[212307]: error: mysql_query failed: 1047 WSREP has not yet prepared node for application use
create table if not exists table_defs_table (creation_time int unsigned not null, mod_time int unsigned default 0 not null, table_name text not null,
Jul 07 09:06:52 bb-aw-slurm01.bear.cluster slurmdbd[212307]: accounting_storage/as_mysql: init: Accounting storage MYSQL plugin failed
Jul 07 09:06:52 bb-aw-slurm01.bear.cluster slurmdbd[212307]: error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
Jul 07 09:06:52 bb-aw-slurm01.bear.cluster slurmdbd[212307]: error: cannot create accounting_storage context for accounting_storage/mysql
Jul 07 09:06:52 bb-aw-slurm01.bear.cluster systemd[1]: slurmdbd.service: main process exited, code=exited, status=1/FAILURE
Jul 07 09:06:52 bb-aw-slurm01.bear.cluster systemd[1]: Unit slurmdbd.service entered failed state.
Jul 07 09:06:52 bb-aw-slurm01.bear.cluster systemd[1]: slurmdbd.service failed.
But either way, we suspect database corruption may be the problem, but slurmdbd logs don't give anything to go on.
So questions,
1. How can we troubleshoot further? I am currently attempting to rebuild slurm with debugging symbols (https://slurm.schedmd.com/faq.html#debug) as gdb is not giving any useful information at all ATM.
2. What are the implications of restoring the database? How does slurm reconcile the database and the state directory?
Many thanks,
Luke
--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road
Please note I don't work on Monday.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220707/3c694b6b/attachment-0001.htm>
More information about the slurm-users
mailing list