[slurm-users] Running multi jobs on one CPU in parallel

Tue Sep 14 19:47:29 UTC 2021

*-O*, *--overcommit*
    Overcommit resources. When applied to job allocation, only one CPU
    is allocated to the job per node and options used to specify the
    number of tasks per node, socket, core, etc. are ignored. When
    applied to job step allocations (the *srun* command when executed
    within an existing job allocation), this option can be used to
    launch more than one task per CPU. Normally, *srun* will not
    allocate more than one process per CPU. By specifying *--overcommit*
    you are explicitly allowing more than one process per CPU. However
    no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per
    node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/
    and is not a variable, it is set at Slurm build time. 

I have used this successfully to run more jobs than cpus/cores avail.

-e.

Karl Lovink wrote:
> Hello,
>
> I am in the process of setting up our SLURM environment. We want to use
> SLURM during our DDoS exercises for dispatching DDoS attack scripts. We
> need a lot of parallel running jobs on a total of 3 nodes.I can't get it
> to run more than 128 jobs simultaneously. There are 128 cpu's in the
> compute nodes.
>
> How can I ensure that I can run more jobs in parallel than there are
> CPUs in the compute node?
>
> Thanks
> Karl
>
>
> My srun script is:
> srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh
>
> And my slurm.conf file:
> ClusterName=ddos-cluster
> ControlMachine=slurm
> SlurmUser=ddos
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> StateSaveLocation=/opt/slurm/spool/ctld
> SlurmdSpoolDir=/opt/slurm/spool/d
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/opt/slurm/run/.pid
> SlurmdPidFile=/opt/slurm/run/slurmd.pid
> ProctrackType=proctrack/pgid
> PluginDir=/opt/slurm/lib/slurm
> ReturnToService=2
> TaskPlugin=task/none
> SlurmctldTimeout=300
> SlurmdTimeout=300
> InactiveLimit=0
> MinJobAge=300
> KillWait=30
> Waittime=0
> SchedulerType=sched/backfill
>
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core
>
> SlurmctldDebug=3
> SlurmctldLogFile=/opt/slurm/log/slurmctld.log
> SlurmdDebug=3
> SlurmdLogFile=/opt/slurm/log/slurmd.log
> JobCompType=jobcomp/none
> JobAcctGatherType=jobacct_gather/none
> AccountingStorageTRES=gres/gpu
> DebugFlags=CPU_Bind,gres
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageHost=localhost
> AccountingStoragePass=/var/run/munge/munge.socket.2
> AccountingStorageUser=slurm
> SlurmctldParameters=enable_configurable
> GresTypes=gpu
> DefMemPerNode=256000
> NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
> NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
> NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
> PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP
> PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
> .
>