[slurm-users] Simultaneously running multiple jobs on same node
Jan van der Laan
slurm at eoos.dds.nl
Tue Nov 24 08:58:49 UTC 2020
Hi Alex,
Thanks a lot. I suspected it was something trivial.
ubuntu at ip-172-31-12-211:~$ scontrol show config | grep -i defmem
DefMemPerNode = UNLIMITED
Specifying `sbatch --mem=1M job.sh` works. I will probably specify a
default value in the slurm.conf (just tried; that also helps).
Best,
Jan
On 23-11-2020 22:15, Alex Chekholko wrote:
> Hi,
>
> Your job does not request any specific amount of memory, so it gets the
> default request. I believe the default request is all the RAM in the node.
>
> Try something like:
> $ scontrol show config | grep -i defmem
> DefMemPerNode = 64000
>
> Regards,
> Alex
>
>
> On Mon, Nov 23, 2020 at 12:33 PM Jan van der Laan <slurm at eoos.dds.nl
> <mailto:slurm at eoos.dds.nl>> wrote:
>
> Hi,
>
> I am having issues getting slurm to run multiple jobs in parallel on
> the
> same machine.
>
> Most of our jobs are either (relatively) low on CPU and high on memory
> (data processing) or low on memory and high on CPU (simulations). The
> server we have is generally big enough (256GB Mem; 16 cores) to
> accommodate multiple jobs running at the same time and we would like
> use
> slurm to schedule these jobs. However, testing on a small (4 CPU)
> amazon
> server, I am unable to get this working. I would have to use
> `SelectType=select/cons_res` and
> `SelectTypeParameters=CR_CPU_Memory` as
> far as I know. However, when starting multiple jobs using a single CPU
> these are started sequentially and not in parallel.
>
> My `slurm.conf`
>
> ===
> ControlMachine=ip-172-31-37-52
>
> MpiDefault=none
> ProctrackType=proctrack/pgid
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> SlurmUser=slurm
> StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/none
>
> # SCHEDULING
> FastSchedule=1
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU_Memory
>
> # LOGGING AND ACCOUNTING
> AccountingStorageType=accounting_storage/none
> ClusterName=cluster
> JobAcctGatherType=jobacct_gather/none
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
>
> # COMPUTE NODES
> NodeName=ip-172-31-37-52 CPUs=4 RealMemory=7860 CoresPerSocket=2
> ThreadsPerCore=2 State=UNKNOWN
> PartitionName=test Nodes=ip-172-31-37-52 Default=YES MaxTime=INFINITE
> State=UP
> ====
>
> `job.sh`
> ===
> #!/bin/bash
> sleep 30
> env
> ===
>
> Output when running jobs:
> ===
> ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 2
> ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 3
> ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 4
> ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 5
> ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 6
> ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 7
> ubuntu at ip-172-31-37-52:~$ squeue
> JOBID PARTITION NAME USER ST TIME NODES
> NODELIST(REASON)
> 3 test job.sh ubuntu PD 0:00 1
> (Resources)
> 4 test job.sh ubuntu PD 0:00 1
> (Priority)
> 5 test job.sh ubuntu PD 0:00 1
> (Priority)
> 6 test job.sh ubuntu PD 0:00 1
> (Priority)
> 7 test job.sh ubuntu PD 0:00 1
> (Priority)
> 2 test job.sh ubuntu R 0:03 1
> ip-172-31-37-52
> ===
>
> The jobs are run sequentially, while in principle it should be possible
> to run 4 jobs in parallel. I am probably missing something simple. How
> do I get this to work?
>
> Best,
> Jan
>
More information about the slurm-users
mailing list