[slurm-users] speed / efficiency of sacct vs. scontrol
Bas van der Vlies
bas.vandervlies at surf.nl
Mon Feb 27 16:38:01 UTC 2023
We have many jupyterhub jobs on our cluster that also does a lot of job
queries. Could adjust the query time. But what I did is that 1 process
queries all the jobs `squeue --json` and the jupyterhub query script
looks in this output.
Instead that every jupyterhub job queries the batch system. I have only
one. This is specific to the hub environment but if a lot of users run
snakemake you hit the same problem.
As admin I can understand the queries and it is not only snakemake there
are plenty of other tools like hub that also do a lot queries. Some kind
of caching mechanism is nice. Most solve it with a wrapper script.
Just my 2 cents
On 27/02/2023 15:53, Brian Andrus wrote:
> Sorry, I had to share that this is very much like "Are we there yet?" on
> a road trip with kids :)
>
> Slurm is trying to drive. Any communication to slurmctld will involve an
> RPC call (sinfo, squeue, scontrol, etc). You can see how many with sinfo.
> Too many RPC calls will cause failures. Asking slurmdbd will not do that
> to you. In fact, you could have a separate slurmdbd just for queries if
> you wanted. This is why that was suggested as a better option.
>
> So, even if you run 'squeue' once every few seconds, it would impact the
> system. More so depending on the size of the system. We have had that
> issue with users running 'watch squeue' and had to address it.
>
> IMHO, the true solution is that if a job's info NEEDS updated that
> often, have the job itself report what it is doing (but NOT via slurm
> commands). There are numerous ways to do that for most jobs.
>
> Perhaps there is some additional lines that could be added to the job
> that would do a call to a snakemake API and report itself? Or maybe such
> an API could be created/expanded.
>
> Just a quick 2 cents (We may be up to a few dollars with all of those so
> far).
>
> Brian Andrus
>
>
> On 2/27/2023 4:24 AM, Ward Poelmans wrote:
>> On 24/02/2023 18:34, David Laehnemann wrote:
>>> Those queries then should not have to happen too often, although do you
>>> have any indication of a range for when you say "you still wouldn't
>>> want to query the status too frequently." Because I don't really, and
>>> would probably opt for some compromise of every 30 seconds or so.
>>
>> I think this is exactly why hpc sys admins are sometimes not very
>> happy about these tools. You're talking about 10000 of jobs on one
>> hand yet you want fetch the status every 30 seconds? What is the point
>> of that other then overloading the scheduler?
>>
>> We're telling your users not to query the slurm too often and usually
>> give 5 minutes as a good interval. You have to let slurm do it's job.
>> There is no point in querying in a loop every 30 seconds when we're
>> talking about large numbers of jobs.
>>
>>
>> Ward
>
--
--
Bas van der Vlies
| High Performance Computing & Visualization | SURF| Science Park 140 |
1098 XG Amsterdam
| T +31 (0) 20 800 1300 | bas.vandervlies at surf.nl | www.surf.nl |
More information about the slurm-users
mailing list