[slurm-users] Simultaneously running multiple jobs on same node

Mon Nov 23 21:15:25 UTC 2020

Hi,

Your job does not request any specific amount of memory, so it gets the
default request.  I believe the default request is all the RAM in the node.

Try something like:
$ scontrol show config | grep -i defmem
DefMemPerNode           = 64000

Regards,
Alex

On Mon, Nov 23, 2020 at 12:33 PM Jan van der Laan <slurm at eoos.dds.nl> wrote:

> Hi,
>
> I am having issues getting slurm to run multiple jobs in parallel on the
> same machine.
>
> Most of our jobs are either (relatively) low on CPU and high on memory
> (data processing) or low on memory and high on CPU (simulations). The
> server we have is generally big enough (256GB Mem; 16 cores) to
> accommodate multiple jobs running at the same time and we would like use
> slurm to schedule these jobs. However, testing on a small (4 CPU) amazon
> server, I am unable to get this working. I would have to use
> `SelectType=select/cons_res` and `SelectTypeParameters=CR_CPU_Memory` as
> far as I know. However, when starting multiple jobs using a single CPU
> these are started sequentially and not in parallel.
>
> My `slurm.conf`
>
> ===
> ControlMachine=ip-172-31-37-52
>
> MpiDefault=none
> ProctrackType=proctrack/pgid
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> SlurmUser=slurm
> StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/none
>
> # SCHEDULING
> FastSchedule=1
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU_Memory
>
> # LOGGING AND ACCOUNTING
> AccountingStorageType=accounting_storage/none
> ClusterName=cluster
> JobAcctGatherType=jobacct_gather/none
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
>
> # COMPUTE NODES
> NodeName=ip-172-31-37-52 CPUs=4 RealMemory=7860 CoresPerSocket=2
> ThreadsPerCore=2 State=UNKNOWN
> PartitionName=test Nodes=ip-172-31-37-52 Default=YES MaxTime=INFINITE
> State=UP
> ====
>
> `job.sh`
> ===
> #!/bin/bash
> sleep 30
> env
> ===
>
> Output when running jobs:
> ===
> ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 2
> ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 3
> ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 4
> ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 5
> ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 6
> ubuntu at ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 7
> ubuntu at ip-172-31-37-52:~$ squeue
>               JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                   3      test   job.sh   ubuntu PD       0:00      1
> (Resources)
>                   4      test   job.sh   ubuntu PD       0:00      1
> (Priority)
>                   5      test   job.sh   ubuntu PD       0:00      1
> (Priority)
>                   6      test   job.sh   ubuntu PD       0:00      1
> (Priority)
>                   7      test   job.sh   ubuntu PD       0:00      1
> (Priority)
>                   2      test   job.sh   ubuntu  R       0:03      1
> ip-172-31-37-52
> ===
>
> The jobs are run sequentially, while in principle it should be possible
> to run 4 jobs in parallel. I am probably missing something simple. How
> do I get this to work?
>
> Best,
> Jan
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201123/ca50f251/attachment.htm>