[slurm-users] Running multi jobs on one CPU in parallel
Karl Lovink
karl at lovink.net
Tue Sep 14 20:09:59 UTC 2021
Hi Emre,
MAX_TASKS_PER_NODE is set to 512. Does this means I cannot run more than
512 jobs in parallel on one node? Or can I change MAX_TASKS_PER_NODE to
a higher value?
And recompile slurm.....
Regards,
Karl
On 14/09/2021 21:47, Emre Brookes wrote:
> *-O*, *--overcommit*
> Overcommit resources. When applied to job allocation, only one CPU
> is allocated to the job per node and options used to specify the
> number of tasks per node, socket, core, etc. are ignored. When
> applied to job step allocations (the *srun* command when executed
> within an existing job allocation), this option can be used to
> launch more than one task per CPU. Normally, *srun* will not
> allocate more than one process per CPU. By specifying *--overcommit*
> you are explicitly allowing more than one process per CPU. However
> no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per
> node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/
> and is not a variable, it is set at Slurm build time.
>
> I have used this successfully to run more jobs than cpus/cores avail.
>
> -e.
>
>
>
> Karl Lovink wrote:
>> Hello,
>>
>> I am in the process of setting up our SLURM environment. We want to use
>> SLURM during our DDoS exercises for dispatching DDoS attack scripts. We
>> need a lot of parallel running jobs on a total of 3 nodes.I can't get it
>> to run more than 128 jobs simultaneously. There are 128 cpu's in the
>> compute nodes.
>>
>> How can I ensure that I can run more jobs in parallel than there are
>> CPUs in the compute node?
>>
>> Thanks
>> Karl
>>
>>
>> My srun script is:
>> srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh
>>
>> And my slurm.conf file:
>> ClusterName=ddos-cluster
>> ControlMachine=slurm
>> SlurmUser=ddos
>> SlurmctldPort=6817
>> SlurmdPort=6818
>> AuthType=auth/munge
>> StateSaveLocation=/opt/slurm/spool/ctld
>> SlurmdSpoolDir=/opt/slurm/spool/d
>> SwitchType=switch/none
>> MpiDefault=none
>> SlurmctldPidFile=/opt/slurm/run/.pid
>> SlurmdPidFile=/opt/slurm/run/slurmd.pid
>> ProctrackType=proctrack/pgid
>> PluginDir=/opt/slurm/lib/slurm
>> ReturnToService=2
>> TaskPlugin=task/none
>> SlurmctldTimeout=300
>> SlurmdTimeout=300
>> InactiveLimit=0
>> MinJobAge=300
>> KillWait=30
>> Waittime=0
>> SchedulerType=sched/backfill
>>
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core
>>
>> SlurmctldDebug=3
>> SlurmctldLogFile=/opt/slurm/log/slurmctld.log
>> SlurmdDebug=3
>> SlurmdLogFile=/opt/slurm/log/slurmd.log
>> JobCompType=jobcomp/none
>> JobAcctGatherType=jobacct_gather/none
>> AccountingStorageTRES=gres/gpu
>> DebugFlags=CPU_Bind,gres
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStorageHost=localhost
>> AccountingStoragePass=/var/run/munge/munge.socket.2
>> AccountingStorageUser=slurm
>> SlurmctldParameters=enable_configurable
>> GresTypes=gpu
>> DefMemPerNode=256000
>> NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>> NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>> NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>> PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>> PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>
>> .
>>
>
More information about the slurm-users
mailing list