[slurm-users] Simultaneously running multiple jobs on same node

Mon Nov 23 20:32:00 UTC 2020

Hi,

I am having issues getting slurm to run multiple jobs in parallel on the 
same machine.

Most of our jobs are either (relatively) low on CPU and high on memory 
(data processing) or low on memory and high on CPU (simulations). The 
server we have is generally big enough (256GB Mem; 16 cores) to 
accommodate multiple jobs running at the same time and we would like use 
slurm to schedule these jobs. However, testing on a small (4 CPU) amazon 
server, I am unable to get this working. I would have to use 
`SelectType=select/cons_res` and `SelectTypeParameters=CR_CPU_Memory` as 
far as I know. However, when starting multiple jobs using a single CPU 
these are started sequentially and not in parallel.

My `slurm.conf`

===
ControlMachine=ip-172-31-37-52

MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none

# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory

# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
JobAcctGatherType=jobacct_gather/none
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log

# COMPUTE NODES
NodeName=ip-172-31-37-52 CPUs=4 RealMemory=7860 CoresPerSocket=2 
ThreadsPerCore=2 State=UNKNOWN
PartitionName=test Nodes=ip-172-31-37-52 Default=YES MaxTime=INFINITE 
State=UP
====

`job.sh`
===
#!/bin/bash
sleep 30
env
===

Output when running jobs:
===
ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 2
ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 3
ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 4
ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 5
ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 6
ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 7
ubuntu at ip-172-31-37-52:~$ squeue
              JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
                  3      test   job.sh   ubuntu PD       0:00      1 
(Resources)
                  4      test   job.sh   ubuntu PD       0:00      1 
(Priority)
                  5      test   job.sh   ubuntu PD       0:00      1 
(Priority)
                  6      test   job.sh   ubuntu PD       0:00      1 
(Priority)
                  7      test   job.sh   ubuntu PD       0:00      1 
(Priority)
                  2      test   job.sh   ubuntu  R       0:03      1 
ip-172-31-37-52
===

The jobs are run sequentially, while in principle it should be possible 
to run 4 jobs in parallel. I am probably missing something simple. How 
do I get this to work?

Best,
Jan