[slurm-users] speed / efficiency of sacct vs. scontrol
Brian Andrus
toomuchit at gmail.com
Mon Feb 27 14:53:52 UTC 2023
Sorry, I had to share that this is very much like "Are we there yet?" on
a road trip with kids :)
Slurm is trying to drive. Any communication to slurmctld will involve an
RPC call (sinfo, squeue, scontrol, etc). You can see how many with sinfo.
Too many RPC calls will cause failures. Asking slurmdbd will not do that
to you. In fact, you could have a separate slurmdbd just for queries if
you wanted. This is why that was suggested as a better option.
So, even if you run 'squeue' once every few seconds, it would impact the
system. More so depending on the size of the system. We have had that
issue with users running 'watch squeue' and had to address it.
IMHO, the true solution is that if a job's info NEEDS updated that
often, have the job itself report what it is doing (but NOT via slurm
commands). There are numerous ways to do that for most jobs.
Perhaps there is some additional lines that could be added to the job
that would do a call to a snakemake API and report itself? Or maybe such
an API could be created/expanded.
Just a quick 2 cents (We may be up to a few dollars with all of those so
far).
Brian Andrus
On 2/27/2023 4:24 AM, Ward Poelmans wrote:
> On 24/02/2023 18:34, David Laehnemann wrote:
>> Those queries then should not have to happen too often, although do you
>> have any indication of a range for when you say "you still wouldn't
>> want to query the status too frequently." Because I don't really, and
>> would probably opt for some compromise of every 30 seconds or so.
>
> I think this is exactly why hpc sys admins are sometimes not very
> happy about these tools. You're talking about 10000 of jobs on one
> hand yet you want fetch the status every 30 seconds? What is the point
> of that other then overloading the scheduler?
>
> We're telling your users not to query the slurm too often and usually
> give 5 minutes as a good interval. You have to let slurm do it's job.
> There is no point in querying in a loop every 30 seconds when we're
> talking about large numbers of jobs.
>
>
> Ward
More information about the slurm-users
mailing list