[slurm-users] Suspend without gang scheduling

Mon Aug 8 19:51:49 UTC 2022

Following up with a bit more specific color as to what I’m seeing, as well as a solution that I’m ashamed I didn’t come back to it. 

If there is exclusively tier3 work queued up, gang scheduling never comes into play.
If there is tier3+tier1 work queued up, tier1 gets requeued, and tier3 preempts as expected.
If enough work is queued in tier3 that it then triggers a suspend preemption in tier2, thats when things fall over and gang scheduling starts happening inside of tier3 queue.

So the issue seems to have stemmed from my use of OverSubscribe=FORCE:1 in my tier3 partition (separate from the tier1/2 partition).
This was set in anticipation of increasing the forced oversubscription limit in the future, but wanting to keep oversubscription “off” for now.
However, by setting OverSubscribe=NO on the tier3 partition, and leaving OverSubscribe=FORCE:1 on the tier1/2 partition.

So, this gets me to where I wanted to be in the first place, which is tier3 not gang scheduling, while still allowing tier1/tier2 to be requeued/suspended.
So I answered my own question, and hopefully someone will benefit from this.

Reed

> On Aug 8, 2022, at 11:27 AM, Reed Dier <reed.dier at focusvq.com> wrote:
> 
> I’ve got essentially 3 “tiers” of jobs.
> 
> tier1 are stateless and can be requeued
> tier2 are stateful and can be suspended
> tier3 are “high priority” and can preempt tier1 and tier2 with the requisite preemption modes.
> 
>> $ sacctmgr show qos format=name%10,priority%10,preempt%12,preemptmode%10
>>       Name   Priority      Preempt PreemptMod
>> ---------- ---------- ------------ ----------
>>     normal          0                 cluster
>>      tier1         10                 requeue
>>      tier2         10                 suspend
>>      tier3        100  tier1,tier2    cluster
> 
> I also have a separate partition for the same hardware nodes to allow for tier3 to cross partitions to suspend tier2 (if its possible to have this all work in a single partition, please let me know).
> 
> tier1 and tier2 get preempted by tier3 perfectly, but the problem is now that tier3 gets gang scheduled in times of big queues in tier3, when I never want gang scheduling anywhere, but especially not tier3.
> 
>> PreemptType=preempt/qos
>> PreemptMode=SUSPEND,GANG
> 
> This is what is in my slurm.conf, because if I try to set PreemptMode=SUSPEND, the ctld won’t start due to:
>> slurmctld: error: PreemptMode=SUSPEND requires GANG too
> 
> I have also tried to set PreemptMode=OFF in the (tier3) partition as well, but this has had no effect on gang scheduling that I can see.
> 
> Right now, my hit-it-with-a-hammer solution is increasing SchedulerTimeSlice to 65535 that should effectively prevent jobs from gang scheduling.
> While this effectively gets me to the goal I’m looking for, it's inelegant, and if I end up with jobs that go past ~18 hours, this is not going to work as I want/hope/expect.
> 
> So I’m hoping that there is a better solution to this that would solve the root issue to have the tier3 qos/partition not preempt itself.
> 
> Hopefully I’ve described this well enough and someone can offer some pointers on how to have suspend-able jobs in tier2, without having incidental gang-suspension in tier3.
> 
> This is 21.08.8-2 in the production cluster, and I’m testing 22.05.2 in my testing cluster which is behaving the same way.
> 
> Reed

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220808/620d6c4e/attachment.htm>