[slurm-users] Running multiple jobs simultaneously
Matt Jay
mattjay at uw.edu
Thu Sep 26 20:33:52 UTC 2019
Hi Matt,
Check out the "OverSubscribe" partition parameter. Try setting your partition to "OverSubscribe=YES" and then submitting the jobs with the "-oversubscibe" option (or OverSubscribe=FORCE if you want this to happen for all jobs submitted to the partition). Either oversubscribe option can be followed by a colon and the maximum number of jobs that can be assigned to a resource (iirc it defaults to 4 - so you might want to increase to allow the number of jobs you need - ie, maximum number of jobs you need to run simultaneously divided by number of cores available in the partition).
Matt Jay
HPC Systems Engineer - Hyak
Research Computing
University of Washington Information Technology
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Matt Hohmeister
Sent: Thursday, September 26, 2019 9:14 AM
To: slurm-users at schedmd.com
Subject: [slurm-users] Running multiple jobs simultaneously
I have a two-node cluster running Slurm, and I'm being asked about allowing multiple jobs (hundreds of jobs) to run simultaneously. Following is my scheduling part of slurm.conf, which I changed to allow multiple jobs to run on each node:
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
For testing purposes, I'm running this job:
#!/bin/bash
#SBATCH --job-name=whatever
#SBATCH --output=slurmBatchLists_Aug19.out
#SBATCH --error=slurmBatchLists_Aug19.err
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --array=70-100
#SBATCH --cpus-per-task=5
matlab -nodisplay -nojvm -r 'sampleSlurm($SLURM_ARRAY_TASK_ID);'
...which gives me the following squeue output:
[mhohmeis at odin ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1742_[82-100] debug whatever mhohmeis PD 0:00 1 (Resources)
1755_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority)
1756_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority)
1757_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority)
1758_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority)
1759_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority)
1760_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority)
1761_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority)
1762_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority)
1763_[70-100] debug whatever mhohmeis PD 0:00 1 (Priority)
1742_70 debug whatever mhohmeis R 0:03 1 odin
1742_71 debug whatever mhohmeis R 0:03 1 odin
1742_72 debug whatever mhohmeis R 0:03 1 odin
1742_73 debug whatever mhohmeis R 0:03 1 odin
1742_74 debug whatever mhohmeis R 0:03 1 odin
1742_75 debug whatever mhohmeis R 0:03 1 odin
1742_76 debug whatever mhohmeis R 0:03 1 thor
1742_77 debug whatever mhohmeis R 0:03 1 thor
1742_78 debug whatever mhohmeis R 0:03 1 thor
1742_79 debug whatever mhohmeis R 0:03 1 thor
1742_80 debug whatever mhohmeis R 0:03 1 thor
1742_81 debug whatever mhohmeis R 0:03 1 thor
They're interested in allowing *all* these jobs to run simultaneously. Also, when they add #SBATCH --ntasks=30 to the above .sbatch file, this happens when they try to run it:
[mhohmeis at odin ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2052_[70-100] debug whatever mhohmeis PD 0:00 4 (PartitionConfig)
Any thoughts?
Thanks!
Matt Hohmeister
Systems and Network Administrator
Department of Psychology
Florida State University
PO Box 3064301
Tallahassee, FL 32306-4301
Phone: +1 850 645 1902
Fax: +1 850 644 7739
Pronouns: he/him/his
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190926/8d375c79/attachment.htm>
More information about the slurm-users
mailing list