[slurm-users] Running multiple jobs simultaneously

Thu Sep 26 20:33:52 UTC 2019

Hi Matt,

Check out the "OverSubscribe" partition parameter.  Try setting your partition to "OverSubscribe=YES" and then submitting the jobs with the "-oversubscibe" option (or OverSubscribe=FORCE if you want this to happen for all jobs submitted to the partition).   Either oversubscribe option can be followed by a colon and the maximum number of jobs that can be assigned to a resource (iirc it defaults to 4 - so you might want to increase to allow the number of jobs you need - ie, maximum number of jobs you need to run simultaneously divided by number of cores available in the partition).

Matt Jay
HPC Systems Engineer - Hyak
Research Computing
University of Washington Information Technology

From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Matt Hohmeister
Sent: Thursday, September 26, 2019 9:14 AM
To: slurm-users at schedmd.com
Subject: [slurm-users] Running multiple jobs simultaneously

I have a two-node cluster running Slurm, and I'm being asked about allowing multiple jobs (hundreds of jobs) to run simultaneously. Following is my scheduling part of slurm.conf, which I changed to allow multiple jobs to run on each node:

# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core

For testing purposes, I'm running this job:

#!/bin/bash
#SBATCH --job-name=whatever
#SBATCH --output=slurmBatchLists_Aug19.out
#SBATCH --error=slurmBatchLists_Aug19.err
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --array=70-100
#SBATCH --cpus-per-task=5
matlab -nodisplay -nojvm -r 'sampleSlurm($SLURM_ARRAY_TASK_ID);'

...which gives me the following squeue output:

[mhohmeis at odin ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     1742_[82-100]     debug whatever mhohmeis PD       0:00      1 (Resources)
     1755_[70-100]     debug whatever mhohmeis PD       0:00      1 (Priority)
     1756_[70-100]     debug whatever mhohmeis PD       0:00      1 (Priority)
     1757_[70-100]     debug whatever mhohmeis PD       0:00      1 (Priority)
     1758_[70-100]     debug whatever mhohmeis PD       0:00      1 (Priority)
     1759_[70-100]     debug whatever mhohmeis PD       0:00      1 (Priority)
     1760_[70-100]     debug whatever mhohmeis PD       0:00      1 (Priority)
     1761_[70-100]     debug whatever mhohmeis PD       0:00      1 (Priority)
     1762_[70-100]     debug whatever mhohmeis PD       0:00      1 (Priority)
     1763_[70-100]     debug whatever mhohmeis PD       0:00      1 (Priority)
           1742_70     debug whatever mhohmeis  R       0:03      1 odin
           1742_71     debug whatever mhohmeis  R       0:03      1 odin
           1742_72     debug whatever mhohmeis  R       0:03      1 odin
           1742_73     debug whatever mhohmeis  R       0:03      1 odin
           1742_74     debug whatever mhohmeis  R       0:03      1 odin
           1742_75     debug whatever mhohmeis  R       0:03      1 odin
           1742_76     debug whatever mhohmeis  R       0:03      1 thor
           1742_77     debug whatever mhohmeis  R       0:03      1 thor
           1742_78     debug whatever mhohmeis  R       0:03      1 thor
           1742_79     debug whatever mhohmeis  R       0:03      1 thor
           1742_80     debug whatever mhohmeis  R       0:03      1 thor
           1742_81     debug whatever mhohmeis  R       0:03      1 thor

They're interested in allowing *all* these jobs to run simultaneously. Also, when they add #SBATCH --ntasks=30 to the above .sbatch file, this happens when they try to run it:

[mhohmeis at odin ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     2052_[70-100]     debug whatever mhohmeis PD       0:00      4 (PartitionConfig)

Any thoughts?
Thanks!

Matt Hohmeister
Systems and Network Administrator
Department of Psychology
Florida State University
PO Box 3064301
Tallahassee, FL 32306-4301
Phone: +1 850 645 1902
Fax: +1 850 644 7739
Pronouns: he/him/his

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190926/8d375c79/attachment.htm>