[slurm-users] Running 2 jobs on one node uses the same cores, 300x slowdown

Wed Nov 24 00:33:00 UTC 2021

We are running slurm 20.11.2-1 from CentOS 7 rpms.

The queue is set up to allow OverSubscribe:

NodeName=ne[04-09] CPUs=32 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
PartitionName=neon-noSMT Nodes=ne[04-09] Default=NO MaxTime=3-00:00:00 DefaultTime=4:00:00 State=UP OverSubscribe=YES

I requested a user submit the first job:

#SBATCH --partition=neon-noSMT
#SBATCH --job-name="ns072"
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=24:00:00
#SBATCH --exclusive
#SBATCH --error=ns072.err
#SBATCH --output=ns072.out
#SBATCH --mail-type=ALL  # NONE, BEGIN, END, FAIL, REQUEUE, ALL
#SBATCH --mail-user=u <mailto:mail-user=tgjenkins at txcorp.com>ser at corp.com

I requested the user submit the second job using the same SBATCH
commands as above, but adding:

#SBATCH —-oversubscribe

and the command to run the second job on the same node
as the first job:

sbatch —nodelist={node running first job} run.sbatch

Note each job only uses 8 ntasks/cores, out of 32 available.

When he submits the second job, the first job slows
down to 300x slower.

If I login to the node running the 2 jobs, only the top 8 
cores/ntasks are being used, not 8 for each job.

These are the SCHEDULING parameters from /etc/slurm/slurm.conf:

# SCHEDULING
# out 29Dec20 
#FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/linear
SelectTypeParameters=CR_ONE_TASK_PER_CORE

Is there a different parameter I should be looking at?

Thanks in advance,
Anne Hammond

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211123/ed621796/attachment-0001.htm>