[slurm-users] slow sacct queries after upgrading to 17.11.0

Pablo Escobar pescobar001 at gmail.com
Mon Dec 18 15:36:37 MST 2017


Hi,

We have upgraded from 17.02.3 to 17.11.0 and after the upgrade we have
noticed that a simple "sacct -j $jobid" takes much longer than before.
Before the upgrade sacct was near immediate and now it takes around 1
minute.

After enabling the slow queries log in mariadb we have found this slow
query which is triggered by "sacct -j jobid":


# Time: 171218 23:19:59
# User at Host: slurm[slurm] @ localhost []
# Thread_id: 17  Schema: slurm  QC_hit: No
# Query_time: 54.930359  Lock_time: 0.000171  Rows_sent: 1  Rows_examined:
19289595
SET timestamp=1513635599;
select t1.account, t1.admin_comment, t1.array_max_tasks, t1.array_task_str,
t1.cpus_req, t1.derived_ec, t1.derived_es, t1.exit_code, t1.id_array_job,
t1.id_array_task, t1.id_assoc, t1.id_block, t1.id_group, t1.id_job,
t1.pack_job_id, t1.pack_job_offset, t1.id_qos, t1.id_resv, t3.resv_name,
t1.id_user, t1.id_wckey, t1.job_db_inx, t1.job_name, t1.kill_requid,
t1.mem_req, t1.node_inx, t1.nodelist, t1.nodes_alloc, t1.partition,
t1.priority, t1.state, t1.time_eligible, t1.time_end, t1.time_start,
t1.time_submit, t1.time_suspended, t1.timelimit, t1.track_steps, t1.wckey,
t1.gres_alloc, t1.gres_req, t1.gres_used, t1.tres_alloc, t1.tres_req,
t1.work_dir, t1.mcs_label, t2.acct, t2.lft, t2.user from
"scicore_job_table" as t1 left join "scicore_assoc_table" as t2 on
t1.id_assoc=t2.id_assoc left join "scicore_resv_table" as t3 on
t1.id_resv=t3.id_resv && ((t1.time_start && (t3.time_start < t1.time_start
&& (t3.time_end >= t1.time_start || t3.time_end = 0))) || ((t3.time_start <
t1.time_submit && (t3.time_end >= t1.time_submit || t3.time_end = 0)) ||
(t3.time_start > t1.time_submit))) where (t1.id_job in (12000000) ||
t1.pack_job_id in (12000000) || (t1.id_array_job in (12000000))) && (state
!= 524288) group by id_job, time_submit desc;


We have many jobs in our db but before the upgrade this was not a problem
when querying a single job id. Slurmdbd is running in the same machine.
Here some numbers from our db:

MariaDB [slurm]> SELECT COUNT(*) FROM scicore_job_table;
+----------+
| COUNT(*) |
+----------+
| 19289598 |
+----------+

MariaDB [slurm]> SELECT COUNT(*) FROM scicore_step_table;
+----------+
| COUNT(*) |
+----------+
| 26635716 |
+----------+


Any suggestion about how to solve this problem? Thanks in advance for any
help.

regards,
Pablo.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171218/218f24cd/attachment.html>


More information about the slurm-users mailing list