<div dir="ltr"><div>Hi David,</div><div><br></div><div>scontrol - interacts with slurmctld using RPC, so it is faster, but requests put load on the scheduler itself.</div><div>sacct - interacts with slurmdbd, so it doesn't place additional load on the scheduler.</div><div><br></div><div>There is a balance to reach, but the scontrol approach is riskier and can start to interfere with the cluster operation if used incorrectly.</div><div><br></div><div>Best,<br></div><div><br></div><div>-Sean<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 23, 2023 at 5:59 AM David Laehnemann <<a href="mailto:david.laehnemann@hhu.de">david.laehnemann@hhu.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dear Slurm users and developers,<br>
<br>
TL;DR:<br>
Do any of you know if `scontrol` status checks of jobs are always<br>
expected to be quicker than `sacct` job status checks? Do you have any<br>
comparative timings between the two commands?<br>
And consequently, would using `scontrol` thus be the better default<br>
option (as opposed to `sacct`) for repeated job status checks by a<br>
workflow management system?<br>
<br>
<br>
And here's the long version with background infos and linkouts:<br>
<br>
I have recently started using a Slurm cluster and am a regular user of<br>
the workflow management system snakemake (<br>
<a href="https://snakemake.readthedocs.io/en/latest/" rel="noreferrer" target="_blank">https://snakemake.readthedocs.io/en/latest/</a>). This workflow manager<br>
recently integrated support for running analysis workflows pretty<br>
seamlessly on Slurm clusters. It takes care of managing all job<br>
dependecies and handles the submission of jobs according to your global<br>
(and job-specific) resource configurations.<br>
<br>
One little hiccup when starting to use the snakemake-Slurm combination<br>
was a snakemake-internal rate-limitation for checking job statuses. You<br>
can find the full story here:<br>
<a href="https://github.com/snakemake/snakemake/pull/2136" rel="noreferrer" target="_blank">https://github.com/snakemake/snakemake/pull/2136</a><br>
<br>
For debugging this, I obtained timings on `sacct` and `scontrol`, with<br>
`scontrol` consistently about 2.5x quicker in returning the job status<br>
when compared to `sacct`. Timings are recorded here:<br>
<a href="https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210" rel="noreferrer" target="_blank">https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210</a><br>
<br>
However, currently `sacct` is used for regularly checking the status of<br>
submitted jobs per default, and `scontrol` is only a fallback whenever<br>
`sacct` doesn't find the job (for example because it is not yet<br>
running). Now, I was wondering if switching the default to `scontrol`<br>
would make sense. Thus, I would like to ask:<br>
<br>
1) Slurm users, whether they also have similar timings on different<br>
Slurm clusters and whether those confirm that `scontrol` is<br>
consistently quicker?<br>
<br>
2) Slurm developers, whether `scontrol` is expected to be quicker from<br>
its implementation and whether using `scontrol` would also be the<br>
option that puts less strain on the scheduler in general?<br>
<br>
Many thanks and best regards,<br>
David<br>
<br>
<br>
</blockquote></div>