[slurm-users] Suspend QOS help
Walls, Mitchell
miwalls at siue.edu
Fri Feb 18 15:54:17 UTC 2022
Both jobs would be using the whole node same as below but with two nodes. I've reduced the problem space to two isolated partitions on just node04.
NodeName=node04 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257476 Features=cpu
# qoses have stayed the same.
Name Priority Preempt PreemptMode
---------- ---------- ---------- -----------
general 1000 suspend cluster
suspend 100 cluster
# test partitions
PartitionName=test Default=NO Nodes=cc-cpu-04 OverSubscribe=FORCE:1 MaxTime=30-00:00:00 Qos=general AllowQos=general
PartitionName=suspend Default=NO Nodes=cc-cpu-04 OverSubscribe=FORCE:1 MaxTime=30-00:00:00 Qos=suspend AllowQos=suspend
stress-suspend.sh
#!/bin/bash
#SBATCH -p suspend
#SBATCH -C cpu
#SBATCH -q suspend
#SBATCH -c 32
#SBATCH --ntasks-per-node=1
#SBATCH -N 1
stress -c 32 -t $1
#stress.sh
#!/bin/bash
#SBATCH -p test
#SBATCH -C cpu
#SBATCH -q general
#SBATCH -c 32
#SBATCH --ntasks-per-node=1
#SBATCH -N 1
stress -c 32 -t $1
________________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Brian Andrus <toomuchit at gmail.com>
Sent: Friday, February 18, 2022 9:36 AM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] Suspend QOS help
First look and I would guess that there are enough resources to satisfy
the requests of both jobs, so no need to suspend.
Having the node info and the job info to compare would be the next step.
Brian Andrus
On 2/18/2022 7:20 AM, Walls, Mitchell wrote:
> Hello,
>
> Hoping someone can shed some light on what is causing jobs to run on same nodes simultaneously rather than being actually suspended for the lower priority job? I can provide more info if someone can think of something to help!
>
> # Relevant config.
> PreemptType=preempt/qos
> PreemptMode=SUSPEND,GANG
>
> PartitionName=general Default=YES Nodes=general OverSubscribe=FORCE:1 MaxTime=30-00:00:00 Qos=general AllowQos=general
> PartitionName=suspend Default=NO Nodes=general OverSubscribe=FORCE:1 MaxTime=30-00:00:00 Qos=suspend AllowQos=suspend
>
> # Qoses
> Name Priority Preempt PreemptMode
> ---------- ---------- ---------- -----------
> general 1000 suspend cluster
> suspend 100 cluster
>
> # squeue (another note is I see that both processes are actually running at same time and not being timesliced in htop)
> $ squeue
> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
> 45085 general stress.s user2 R 7:33 2 node[04-05]
> 45084 suspend stress-s user1 R 7:40 2 node[04-05]
>
> Thanks!
More information about the slurm-users
mailing list