[slurm-users] [EXT] Weird issues with slurm's Priority
zaxs84
sciuscianebbia at gmail.com
Tue Jul 7 14:47:50 UTC 2020
Hi Sean,
thank you very much for your reply.
> If a lower priority job can start AND finish before the resources a
higher priority job requires are available, the backfill scheduler will
start the lower priority job.
That's very interesting, but how can the scheduler predict how long a
low-priority job will take?
> In your example job list, can you also list the requested times for each
job? That will show if it is the backfill scheduler doing what it is
designed to do.
You mean the wall clock time? If that's the case, we don't usually set that.
Thanks again
On Tue, Jul 7, 2020 at 11:39 AM Sean Crosby <scrosby at unimelb.edu.au> wrote:
> Hi,
>
> What you have described is how the backfill scheduler works. If a lower
> priority job can start AND finish before the resources a higher priority
> job requires are available, the backfill scheduler will start the lower
> priority job.
>
> Your high priority job requires 24 cores, whereas the lower priority jobs
> only require 1 core each. Therefore there might be some free resources the
> lower priority jobs can use that the 24 core job can't. The backfill
> scheduler can make the lower priority jobs take advantage of those free
> cores, but only if it then doesn't stop the higher priority job from
> starting in its original time.
>
> In your example job list, can you also list the requested times for each
> job? That will show if it is the backfill scheduler doing what it is
> designed to do.
>
> Sean
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
> On Tue, 7 Jul 2020 at 19:05, zaxs84 <sciuscianebbia at gmail.com> wrote:
>
>> *UoM notice: External email. Be cautious of links, attachments, or
>> impersonation attempts.*
>> ------------------------------
>> Hi all.
>>
>> We want to achieve a simple thing with slurm: launch "normal" jobs, and
>> be able to launch "high priority" jobs that run as soon as possible. End of
>> it. However we cannot achieve this in a reliable way, meaning that our
>> current config sometimes works, sometimes not, and this is driving us crazy.
>>
>> When it works, this is what happens:
>> - we have, let's say, 10 jobs running with normal priority (--qos=normal,
>> having final Priority=1001) and few thousands in PENDING state
>> - we submit a new job with high priority (--qos=high, having final
>> Priority=1001001)
>> - at this point, slurm waits until the normal priority job will end to
>> free up required resources, and then starts a new High priority job. That's
>> Perfect!
>>
>> However, from time to time, randomly, this does not happen. Here is an
>> example:
>>
>> # the node has around 200GB of memory and 24 CPUs
>> Partition=t1 State=PD Priority=1001001 Nice=0 ID=337455 CPU=24 Memory=80G
>> Nice=0 Started=0:00 User=u1 Submitted=2020-07-07T07:16:47
>> Partition=t1 State=R Priority=1001 Nice=0 ID=337475 CPU=1 Memory=1024M
>> Nice=0 Started=1:22 User=u1 Submitted=2020-07-07T10:31:46
>> Partition=t1 State=R Priority=1001 Nice=0 ID=334355 CPU=1 Memory=1024M
>> Nice=0 Started=58:09 User=u1 Submitted=2020-06-23T09:57:11
>> Partition=t1 State=R Priority=1001 Nice=0 ID=334354 CPU=1 Memory=1024M
>> Nice=0 Started=6:29:59 User=u1 Submitted=2020-06-23T09:57:11
>> Partition=t1 State=R Priority=1001 Nice=0 ID=334353 CPU=1 Memory=1024M
>> Nice=0 Started=13:25:55 User=u1 Submitted=2020-06-23T09:57:11
>> [...]
>>
>> You see? Slurm keep starting jobs that have a lower priority. Why is that?
>>
>> Some info about our config: Slurm is version 16.05. Here is the priority
>> config of slurm:
>>
>> ##### file /etc/slurm-llnl/slurm.conf
>> PriorityType=priority/multifactor
>> PriorityFavorSmall=NO
>> PriorityWeightQOS=1000000
>> PriorityWeightFairshare=1000
>> PriorityWeightPartition=1000
>> PriorityWeightJobSize=0
>> PriorityWeightAge=0
>>
>> ##### command "sacctmgr show qos"
>> Name Priority MaxSubmitPA
>> normal 0 30
>> high 1000
>>
>>
>> Any idea?
>>
>> Thanks
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200707/42ada65b/attachment.htm>
More information about the slurm-users
mailing list