[slurm-users] speed / efficiency of sacct vs. scontrol

Fri Feb 24 17:34:41 UTC 2023

Hi Sean,

Thanks again for all the feedback!

I'll definitely try to implement batch queries, then. Both for the
default `sacct` query and for the fallback `scontrol` query. Also see
here:
https://github.com/snakemake/snakemake/pull/2136#issuecomment-1443295051

Those queries then should not have to happen too often, although do you
have any indication of a range for when you say "you still wouldn't
want to query the status too frequently." Because I don't really, and
would probably opt for some compromise of every 30 seconds or so.

One thing I didn't understand from your eMail is the part about job
names, as the command I gave doesn't use job names for its query:

sacct -X -P -n --format=JobIdRaw,State -j <jobid_1>,<jobid_2>,...

Instead, it just uses the JobId, and isn't that guaranteed to be unique
at any point in time? Or were you meaning to say that JobId can be non-
unique? That would indeed spell trouble on a different level, and make
status checks much more complicated...

cheers,
david

On Thu, 2023-02-23 at 11:59 -0500, Sean Maxwell wrote:
> Hi David,
> 
> On Thu, Feb 23, 2023 at 10:50 AM David Laehnemann <
> david.laehnemann at hhu.de>
> wrote:
> 
> > But from your comment I understand that handling these queries in
> > batches would be less work for slurmdbd, right? So instead of
> > querying
> > each jobid with a separate database query, it would do one database
> > query for the whole list? Is that really easier for the system, or
> > would it end up doing a call for each jobid, anyway?
> > 
> 
> From the perspective of avoiding RPC flood, it is much better to use
> a
> batch query. That said, if you have an extremely large number of jobs
> in
> the queue, you still wouldn't want to query the status too
> frequently.
> 
> 
> > And just to be as clear as possible, a call to sacct would then
> > look
> > like this:
> > sacct -X -P -n --format=JobIdRaw,State -j <jobid_1>,<jobid_2>,...
> > 
> 
> That would be one way to do it, but I think there are other
> approaches that
> might be better. For example, there is no requirement for the job
> name to
> be unique. So if the snakemake pipeline has a configurable instance
> name="foo", and snakemake was configured to specify its own name as
> the job
> when submitting jobs (e.g. sbatch -J foo ...) then the query for all
> jobs
> in the pipeline is simply:
> 
> sacct --name=foo
> 
> Because we can of course rewrite the respective code section, so any
> > insight on how to do this job accounting more efficiently (and
> > better
> > tailored to how Slurm does things) is appreciated.
> > 
> 
> I appreciate that you are interested in improving the integration to
> make
> it more performant. We are seeing an increase in meta-scheduler use
> at our
> site, so this is a worthwhile problem to tackle.
> 
> Thanks,
> 
> -Sean