[slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS
Dustin Lang
dstndstn at gmail.com
Tue May 5 19:39:13 UTC 2020
I tried upgrading Slurm to 18.08.9 and I am still getting this Segmentation
Fault!
On Tue, May 5, 2020 at 2:39 PM Dustin Lang <dstndstn at gmail.com> wrote:
> Hi,
>
> Apparently my colleague upgraded the mysql client and server, but, as far
> as I can tell, this was only 5.7.29 to 5.7.30, and checking the mysql
> release notes I don't see anything that looks suspicious there...
>
> cheers,
> --dustin
>
>
> On Tue, May 5, 2020 at 1:37 PM Dustin Lang <dstndstn at gmail.com> wrote:
>
>> Hi,
>>
>> We're running Slurm 17.11.12. Everything has been working fine, and then
>> suddenly slurmctld is crashing and slurmdbd is crashing.
>>
>> We use fair-share as part of the queuing policy, and previously set up
>> accounts with sacctmgr; that has been working fine for months.
>>
>> If I run slurmdbd in debug mode,
>>
>> slurmdbd -D -v -v -v -v -v
>>
>> it eventually (after being contacted by slurmctld) segfaults with:
>>
>> ...
>> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn049 STATE:UP REASON:(null)
>> TIME:1588695584
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn050 STATE:UP REASON:(null)
>> TIME:1588695584
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_GET_TRES: called
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_GET_QOS: called
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_GET_USERS: called
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_GET_ASSOCS: called
>> slurmdbd: debug4: 10(as_mysql_assoc.c:2033) query
>> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); select
>> @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos,
>> @delta_qos;
>> Segmentation fault (core dumped)
>>
>>
>> It looks (running slurmdbd in gdb) like that segfault is coming from
>>
>>
>> https://github.com/SchedMD/slurm/blob/slurm-17-11-12-1/src/plugins/accounting_storage/mysql/as_mysql_assoc.c#L2073
>>
>> and If I connect to the mysql database directly and call that stored
>> procedure, I get
>>
>> mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0);
>>
>> +---------------------+-----------------+-------------------------+----------------------+---------------------------+-------------+-----------------------------------------------------------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-----------------------------+
>> | @par_id := id_assoc | @mj := max_jobs | @msj := max_submit_jobs | @mwpj
>> := max_wall_pj | @def_qos_id := def_qos_id | @qos := qos | @delta_qos :=
>> REPLACE(CONCAT(delta_qos, @delta_qos), ',,', ',') | @mtpj := CONCAT(@mtpj,
>> if (@mtpj != '' && max_tres_pj != '', ',', ''), max_tres_pj) | @mtpn :=
>> CONCAT(@mtpn, if (@mtpn != '' && max_tres_pn != '', ',', ''), max_tres_pn)
>> | @mtmpj := CONCAT(@mtmpj, if (@mtmpj != '' && max_tres_mins_pj != '', ',',
>> ''), max_tres_mins_pj) | @mtrm := CONCAT(@mtrm, if (@mtrm != '' &&
>> max_tres_run_mins != '', ',', ''), max_tres_run_mins) | @my_acct_new :=
>> parent_acct |
>>
>> +---------------------+-----------------+-------------------------+----------------------+---------------------------+-------------+-----------------------------------------------------------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-----------------------------+
>> | 1 | NULL | NULL |
>> NULL | NULL | ,1, | NULL
>> | NULL
>> | NULL
>> | NULL
>>
>> | NULL
>> | |
>>
>> +---------------------+-----------------+-------------------------+----------------------+---------------------------+-------------+-----------------------------------------------------------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-----------------------------+
>>
>> and if I run
>>
>> mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0);
>> select @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id,
>> @qos, @delta_qos;
>>
>> I get
>>
>>
>> +---------+------+------+-------+-------+-------+--------+-------+-------------+------+------------+
>> | @par_id | @mj | @msj | @mwpj | @mtpj | @mtpn | @mtmpj | @mtrm |
>> @def_qos_id | @qos | @delta_qos |
>>
>> +---------+------+------+-------+-------+-------+--------+-------+-------------+------+------------+
>> | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
>> NULL | ,1, | NULL |
>>
>> +---------+------+------+-------+-------+-------+--------+-------+-------------+------+------------+
>>
>> but I don't know what to do about this.
>>
>> We use another product ("Bright Cluster Manager") to manage some aspects
>> of the cluster and Slurm installation, so we are hesitant to just upgrade
>> Slurm.
>>
>> I would appreciate any tips.
>>
>> Thanks,
>> --dustin
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200505/21b3d133/attachment.htm>
More information about the slurm-users
mailing list