[slurm-users] Job preempts entire host instead of single job

Michał Kadlof michal.kadlof at pw.edu.pl
Tue Jan 17 12:04:03 UTC 2023


Hi,

I struggle with configuring job preempting. I have nodes with 8 Nvidia 
A100 GPUs. I have two partitions: short (lower priority) and sfglab 
(higher priority). I want to allow higher priority jobs to preempt 
(REQUEUE mode) lower priority job. It looks like it works, however it 
works too good.

Job from higher priority partition preempts entire host instead of only 
single job which would be enough to release resources for higher 
priority partition. Whats more it lock the rest of resources until 
high-prio job will end. What am I doing wrong?

Here is example:

$ srun --test-only -G1 -c1 --mem 1M -p sfglab
srun: Job 501151 to start at 2023-01-17T12:46:01 using 1 processors on 
nodes dgx-1 in partition sfglab
srun:   Preempts: 363278,501001,501029,501075,501076,501077,501120,501121

To release these resources it would be enough to preempt one job instead 
of all.


Here is my config:

slurm.conf

(...)

DefMemPerCPU            = 100
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
PreemptMode             = REQUEUE
PreemptType             = preempt/partition_prio
PreemptExemptTime       = 00:00:00
SelectType              = select/cons_tres
SelectTypeParameters    = CR_CORE_MEMORY

(...)

PartitionName=short Nodes=dgx-[1-4],sr-[1-3] MaxTime=1-0 State=UP 
PriorityTier=10000 Default=YES DefaultTime=0-01:00:00 OverSubscribe=NO 
PreemptMode=requeue

PartitionName=sfglab Nodes=dgx-1 MaxTime=10-0 State=UP 
PriorityTier=20000 PreemptMode=off OverSubscribe=NO AllowAccounts=sfglab

-- 
best regards | pozdrawiam serdecznie
*Michał Kadlof*
Head of the high performance computing center
Eden^N cluster administrator
Faculty of Mathematics and Computer Science
Warsaw University of Technology
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230117/c808a37b/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4788 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230117/c808a37b/attachment.bin>


More information about the slurm-users mailing list