[slurm-users] Slurm does not start after (stupid) upgrade from 16.05.9 to 20.11.7

Julien Tailleur julien.tailleur at gmail.com
Wed Aug 25 08:48:53 UTC 2021


Dear all,

We have been running a computing cluster using slurm since 2016, that I 
installed back then, with some help from others. I was pretty late on 
upgrades and decided to upgrade the cluster up to debian Bullseye, which 
runs slurm 20.11.7, starting from stretch, that runs slurm 16.05.9.

While the update of the system in itself went smoothly, slurm is broken. 
Of course, that's the stage at which I thought "Oh, I should have 
checked if the upgrade is supposed to be harmless"... Now that's the 
self-bashing is rightfully done, I would be very happy with some help! I 
hesitate between two strategies: removing slurm completely and a 
completely new installation, or trying to save what can be saved... I am 
tempted by the former since I remember suffering a bit to get the 
installation right in the first place...

Munge works still fine but when I run

slurmctld -Dvvvvv -c

every goes smoothly until:

[...]
slurmctld: accounting_storage/slurmdbd: init: Accounting storage 
SLURMDBD plugin loaded
slurmctld: debug3: Success.
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 
127.0.1.1:6819: Connection refused
slurmctld: error: slurm_persist_conn_open_without_init: failed to open 
persistent connection to host:kandinsky:6819: Connection refused
slurmctld: error: Sending PersistInit msg: Connection refused
slurmctld: accounting_storage/slurmdbd: _load_dbd_state: recovered 0 
pending RPCs
slurmctld: accounting_storage/slurmdbd: 
clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 
with slurmdbd
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 
127.0.1.1:6819: Connection refused
slurmctld: error: Sending PersistInit msg: Connection refused
slurmctld: debug:  Association database appears down, reading from state 
file.
slurmctld: debug:  create_mmap_buf: Failed to open file 
`/var/spool/slurm.state/last_tres`, No such file or directory
slurmctld: debug2: No last_tres file (/var/spool/slurm.state/last_tres) 
to recover
slurmctld: debug:  create_mmap_buf: Failed to open file 
`/var/spool/slurm.state/assoc_mgr_state`, No such file or directory
slurmctld: debug2: No association state file 
(/var/spool/slurm.state/assoc_mgr_state) to recover
slurmctld: fatal: You are running with a database but for some reason we 
have no TRES from it.  This should only happen if the database is down 
and you don't have any state files.

6819 is the port on which slurmdb is supposed to be running so I tried:

slurmdbd -Dvvvvv

which yields

slurmdbd: debug:  Log file re-opened
slurmdbd: pidfile not locked, assuming no running daemon
slurmdbd: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/auth_munge.so
slurmdbd: debug:  auth/munge: init: Munge authentication plugin loaded
slurmdbd: debug3: Success.
slurmdbd: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/accounting_storage_mysql.so
slurmdbd: debug2: accounting_storage/as_mysql: init: mysql_connect() 
called for db slurm_db
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: 
MySQL server version is: 10.5.11-MariaDB-1
slurmdbd: debug2: accounting_storage/as_mysql: 
_check_database_variables: innodb_buffer_pool_size: 134217728
slurmdbd: debug2: accounting_storage/as_mysql: 
_check_database_variables: innodb_log_file_size: 100663296
slurmdbd: debug2: accounting_storage/as_mysql: 
_check_database_variables: innodb_lock_wait_timeout: 50
slurmdbd: error: Database settings not recommended values: 
innodb_buffer_pool_size innodb_lock_wait_timeout
slurmdbd: debug4: accounting_storage/as_mysql: _set_db_curr_ver: 
0(as_mysql_convert.c:128) query
select version from convert_version_table
slurmdbd: debug4: accounting_storage/as_mysql: 
as_mysql_convert_tables_pre_create: as_mysql_convert_tables_pre_create: 
No conversion needed, Horray!
slurmdbd: debug4: accounting_storage/as_mysql: 
as_mysql_convert_tables_post_create: 
as_mysql_convert_tables_post_create: No conversion needed, Horray!
slurmdbd: debug4: accounting_storage/as_mysql: 
as_mysql_convert_non_cluster_tables_post_create: 
as_mysql_convert_non_cluster_tables_post_create: No conversion needed, 
Horray!
slurmdbd: error: mysql_query failed: 1558 Column count of mysql.proc is 
wrong. Expected 21, found 20. Created with MariaDB 100126, now running 
100511. Please use mariadb-upgrade to fix this error
drop procedure if exists get_parent_limits; create procedure 
get_parent_limits(my_table text, acct text, cluster text, without_limits 
int) begin set @par_id = NULL; set @mj = NULL; set @mja = NULL; set @mpt 
= NULL; set @msj = NULL; set @mwpj = NULL; set @mtpj = ''; set @mtpn = 
''; set @mtmpj = ''; set @mtrm = ''; set @prio = NULL; set @def_qos_id = 
NULL; set @qos = ''; set @delta_qos = ''; set @my_acct = acct; if 
without_limits then set @mj = 0; set @msj = 0; set @mwpj = 0; set @prio 
= 0; set @def_qos_id = 0; set @qos = 1; end if; REPEAT set @s = 'select 
'; if @par_id is NULL then set @s = CONCAT(@s, '@par_id := id_assoc, '); 
end if; if @mj is NULL then set @s = CONCAT(@s, '@mj := max_jobs, '); 
end if; if @mja is NULL then set @s = CONCAT(@s, '@mja := 
max_jobs_accrue, '); end if; if @mpt is NULL then set @s = CONCAT(@s, 
'@mpt := min_prio_thresh, '); end if; if @msj is NULL then set @s = 
CONCAT(@s, '@msj := max_submit_jobs, '); end if; if @mwpj is NULL then 
set @s = CONCAT(@s, '@mwpj := max_wall_pj, '); end if; if @prio is NULL 
then set @s = CONCAT(@s, '@prio := priority, '); end if; if @def_qos_id 
is NULL then set @s = CONCAT(@s, '@def_qos_id := def_qos_id, '); end if; 
if @qos = '' then set @s = CONCAT(@s, '@qos := qos, @delta_qos := 
REPLACE(CONCAT(delta_qos, @delta_qos), \',,\', \',\'), '); end if; set 
@s = concat(@s, '@mtpj := CONCAT(@mtpj, if (@mtpj != \'\' && max_tres_pj 
!= \'\', \',\', \'\'), max_tres_pj), @mtpn := CONCAT(@mtpn, if (@mtpn != 
\'\' && max_tres_pn != \'\', \',\', \'\'), max_tres_pn), @mtmpj := 
CONCAT(@mtmpj, if (@mtmpj != \'\' && max_tres_mins_pj != \'\', \',\', 
\'\'), max_tres_mins_pj), @mtrm := CONCAT(@mtrm, if (@mtrm != \'\' && 
max_tres_run_mins != \'\', \',\', \'\'), max_tres_run_mins), 
@my_acct_new := parent_acct from "', cluster, '_', my_table, '" where 
acct = \'', @my_acct, '\' && user=\'\''); prepare query from @s; execute 
query; deallocate prepare query; set @my_acct = @my_acct_new; UNTIL 
without_limits || @my_acct = '' END REPEAT; END;
slurmdbd: error: mysql_query failed: 1558 Column count of mysql.proc is 
wrong. Expected 21, found 20. Created with MariaDB 100126, now running 
100511. Please use mariadb-upgrade to fix this error
drop procedure if exists get_coord_qos; create procedure 
get_coord_qos(my_table text, acct text, cluster text, coord text) begin 
set @qos = ''; set @delta_qos = ''; set @found_coord = NULL; set 
@my_acct = acct; REPEAT set @s = 'select @qos := t1.qos, @delta_qos := 
REPLACE(CONCAT(t1.delta_qos, @delta_qos), \',,\', \',\'), @my_acct_new 
:= parent_acct, @found_coord_curr := t2.user '; set @s = concat(@s, 
'from "', cluster, '_', my_table, '" as t1 left outer join 
acct_coord_table as t2 on t1.acct=t2.acct where t1.acct = @my_acct && 
t1.user=\'\' && (t2.user=\'', coord, '\' || t2.user is null)'); prepare 
query from @s; execute query; deallocate prepare query; if 
@found_coord_curr is not NULL then set @found_coord = @found_coord_curr; 
end if; if @found_coord is NULL then set @qos = ''; set @delta_qos = ''; 
end if; set @my_acct = @my_acct_new; UNTIL @qos != '' || @my_acct = '' 
END REPEAT; select REPLACE(CONCAT(@qos, @delta_qos), ',,', ','); END;
slurmdbd: accounting_storage/as_mysql: init: Accounting storage MYSQL 
plugin failed
slurmdbd: error: Couldn't load specified plugin name for 
accounting_storage/mysql: Plugin init() callback failed
slurmdbd: error: cannot create accounting_storage context for 
accounting_storage/mysql
slurmdbd: fatal: Unable to initialize accounting_storage/mysql 
accounting storage plugin

It thus seems that the database format is wrong. I do not care about 
previous logs so I would be happy erasing previous table and creating a 
new one, if possible, but I do not know what to do :-)

I tried running

mariadb-upgrade

but got

Version check failed. Got the following error when calling the 'mysql' 
command line client
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using 
password: NO)
FATAL ERROR: Upgrade failed

I have to admit that I do not remember setting a root password, but it 
starts to date back and I was not the only one messing with the 
cluster... I tried to follow this to change the root password:

https://linuxize.com/post/how-to-reset-a-mysql-root-password/

but this does not seem to be working. I would be happy with some 
suggestions !

Best,

Julien Tailleur







More information about the slurm-users mailing list