[slurm-users] Slurm does not start after (stupid) upgrade from 16.05.9 to 20.11.7
Julien Tailleur
julien.tailleur at gmail.com
Wed Aug 25 08:48:53 UTC 2021
Dear all,
We have been running a computing cluster using slurm since 2016, that I
installed back then, with some help from others. I was pretty late on
upgrades and decided to upgrade the cluster up to debian Bullseye, which
runs slurm 20.11.7, starting from stretch, that runs slurm 16.05.9.
While the update of the system in itself went smoothly, slurm is broken.
Of course, that's the stage at which I thought "Oh, I should have
checked if the upgrade is supposed to be harmless"... Now that's the
self-bashing is rightfully done, I would be very happy with some help! I
hesitate between two strategies: removing slurm completely and a
completely new installation, or trying to save what can be saved... I am
tempted by the former since I remember suffering a bit to get the
installation right in the first place...
Munge works still fine but when I run
slurmctld -Dvvvvv -c
every goes smoothly until:
[...]
slurmctld: accounting_storage/slurmdbd: init: Accounting storage
SLURMDBD plugin loaded
slurmctld: debug3: Success.
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
127.0.1.1:6819: Connection refused
slurmctld: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to host:kandinsky:6819: Connection refused
slurmctld: error: Sending PersistInit msg: Connection refused
slurmctld: accounting_storage/slurmdbd: _load_dbd_state: recovered 0
pending RPCs
slurmctld: accounting_storage/slurmdbd:
clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817
with slurmdbd
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
127.0.1.1:6819: Connection refused
slurmctld: error: Sending PersistInit msg: Connection refused
slurmctld: debug: Association database appears down, reading from state
file.
slurmctld: debug: create_mmap_buf: Failed to open file
`/var/spool/slurm.state/last_tres`, No such file or directory
slurmctld: debug2: No last_tres file (/var/spool/slurm.state/last_tres)
to recover
slurmctld: debug: create_mmap_buf: Failed to open file
`/var/spool/slurm.state/assoc_mgr_state`, No such file or directory
slurmctld: debug2: No association state file
(/var/spool/slurm.state/assoc_mgr_state) to recover
slurmctld: fatal: You are running with a database but for some reason we
have no TRES from it. This should only happen if the database is down
and you don't have any state files.
6819 is the port on which slurmdb is supposed to be running so I tried:
slurmdbd -Dvvvvv
which yields
slurmdbd: debug: Log file re-opened
slurmdbd: pidfile not locked, assuming no running daemon
slurmdbd: debug3: Trying to load plugin
/usr/lib/x86_64-linux-gnu/slurm-wlm/auth_munge.so
slurmdbd: debug: auth/munge: init: Munge authentication plugin loaded
slurmdbd: debug3: Success.
slurmdbd: debug3: Trying to load plugin
/usr/lib/x86_64-linux-gnu/slurm-wlm/accounting_storage_mysql.so
slurmdbd: debug2: accounting_storage/as_mysql: init: mysql_connect()
called for db slurm_db
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane:
MySQL server version is: 10.5.11-MariaDB-1
slurmdbd: debug2: accounting_storage/as_mysql:
_check_database_variables: innodb_buffer_pool_size: 134217728
slurmdbd: debug2: accounting_storage/as_mysql:
_check_database_variables: innodb_log_file_size: 100663296
slurmdbd: debug2: accounting_storage/as_mysql:
_check_database_variables: innodb_lock_wait_timeout: 50
slurmdbd: error: Database settings not recommended values:
innodb_buffer_pool_size innodb_lock_wait_timeout
slurmdbd: debug4: accounting_storage/as_mysql: _set_db_curr_ver:
0(as_mysql_convert.c:128) query
select version from convert_version_table
slurmdbd: debug4: accounting_storage/as_mysql:
as_mysql_convert_tables_pre_create: as_mysql_convert_tables_pre_create:
No conversion needed, Horray!
slurmdbd: debug4: accounting_storage/as_mysql:
as_mysql_convert_tables_post_create:
as_mysql_convert_tables_post_create: No conversion needed, Horray!
slurmdbd: debug4: accounting_storage/as_mysql:
as_mysql_convert_non_cluster_tables_post_create:
as_mysql_convert_non_cluster_tables_post_create: No conversion needed,
Horray!
slurmdbd: error: mysql_query failed: 1558 Column count of mysql.proc is
wrong. Expected 21, found 20. Created with MariaDB 100126, now running
100511. Please use mariadb-upgrade to fix this error
drop procedure if exists get_parent_limits; create procedure
get_parent_limits(my_table text, acct text, cluster text, without_limits
int) begin set @par_id = NULL; set @mj = NULL; set @mja = NULL; set @mpt
= NULL; set @msj = NULL; set @mwpj = NULL; set @mtpj = ''; set @mtpn =
''; set @mtmpj = ''; set @mtrm = ''; set @prio = NULL; set @def_qos_id =
NULL; set @qos = ''; set @delta_qos = ''; set @my_acct = acct; if
without_limits then set @mj = 0; set @msj = 0; set @mwpj = 0; set @prio
= 0; set @def_qos_id = 0; set @qos = 1; end if; REPEAT set @s = 'select
'; if @par_id is NULL then set @s = CONCAT(@s, '@par_id := id_assoc, ');
end if; if @mj is NULL then set @s = CONCAT(@s, '@mj := max_jobs, ');
end if; if @mja is NULL then set @s = CONCAT(@s, '@mja :=
max_jobs_accrue, '); end if; if @mpt is NULL then set @s = CONCAT(@s,
'@mpt := min_prio_thresh, '); end if; if @msj is NULL then set @s =
CONCAT(@s, '@msj := max_submit_jobs, '); end if; if @mwpj is NULL then
set @s = CONCAT(@s, '@mwpj := max_wall_pj, '); end if; if @prio is NULL
then set @s = CONCAT(@s, '@prio := priority, '); end if; if @def_qos_id
is NULL then set @s = CONCAT(@s, '@def_qos_id := def_qos_id, '); end if;
if @qos = '' then set @s = CONCAT(@s, '@qos := qos, @delta_qos :=
REPLACE(CONCAT(delta_qos, @delta_qos), \',,\', \',\'), '); end if; set
@s = concat(@s, '@mtpj := CONCAT(@mtpj, if (@mtpj != \'\' && max_tres_pj
!= \'\', \',\', \'\'), max_tres_pj), @mtpn := CONCAT(@mtpn, if (@mtpn !=
\'\' && max_tres_pn != \'\', \',\', \'\'), max_tres_pn), @mtmpj :=
CONCAT(@mtmpj, if (@mtmpj != \'\' && max_tres_mins_pj != \'\', \',\',
\'\'), max_tres_mins_pj), @mtrm := CONCAT(@mtrm, if (@mtrm != \'\' &&
max_tres_run_mins != \'\', \',\', \'\'), max_tres_run_mins),
@my_acct_new := parent_acct from "', cluster, '_', my_table, '" where
acct = \'', @my_acct, '\' && user=\'\''); prepare query from @s; execute
query; deallocate prepare query; set @my_acct = @my_acct_new; UNTIL
without_limits || @my_acct = '' END REPEAT; END;
slurmdbd: error: mysql_query failed: 1558 Column count of mysql.proc is
wrong. Expected 21, found 20. Created with MariaDB 100126, now running
100511. Please use mariadb-upgrade to fix this error
drop procedure if exists get_coord_qos; create procedure
get_coord_qos(my_table text, acct text, cluster text, coord text) begin
set @qos = ''; set @delta_qos = ''; set @found_coord = NULL; set
@my_acct = acct; REPEAT set @s = 'select @qos := t1.qos, @delta_qos :=
REPLACE(CONCAT(t1.delta_qos, @delta_qos), \',,\', \',\'), @my_acct_new
:= parent_acct, @found_coord_curr := t2.user '; set @s = concat(@s,
'from "', cluster, '_', my_table, '" as t1 left outer join
acct_coord_table as t2 on t1.acct=t2.acct where t1.acct = @my_acct &&
t1.user=\'\' && (t2.user=\'', coord, '\' || t2.user is null)'); prepare
query from @s; execute query; deallocate prepare query; if
@found_coord_curr is not NULL then set @found_coord = @found_coord_curr;
end if; if @found_coord is NULL then set @qos = ''; set @delta_qos = '';
end if; set @my_acct = @my_acct_new; UNTIL @qos != '' || @my_acct = ''
END REPEAT; select REPLACE(CONCAT(@qos, @delta_qos), ',,', ','); END;
slurmdbd: accounting_storage/as_mysql: init: Accounting storage MYSQL
plugin failed
slurmdbd: error: Couldn't load specified plugin name for
accounting_storage/mysql: Plugin init() callback failed
slurmdbd: error: cannot create accounting_storage context for
accounting_storage/mysql
slurmdbd: fatal: Unable to initialize accounting_storage/mysql
accounting storage plugin
It thus seems that the database format is wrong. I do not care about
previous logs so I would be happy erasing previous table and creating a
new one, if possible, but I do not know what to do :-)
I tried running
mariadb-upgrade
but got
Version check failed. Got the following error when calling the 'mysql'
command line client
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using
password: NO)
FATAL ERROR: Upgrade failed
I have to admit that I do not remember setting a root password, but it
starts to date back and I was not the only one messing with the
cluster... I tried to follow this to change the root password:
https://linuxize.com/post/how-to-reset-a-mysql-root-password/
but this does not seem to be working. I would be happy with some
suggestions !
Best,
Julien Tailleur
More information about the slurm-users
mailing list