[slurm-users] Running multi jobs on one CPU in parallel
Emre Brookes
emre.brookes at mso.umt.edu
Tue Sep 14 20:42:24 UTC 2021
Hi Karl,
I haven't tested the MAX_TASKS_PER_NODE limits.
According to slurm.conf
*MaxTasksPerNode*
Maximum number of tasks Slurm will allow a job step to spawn on a
single node.
The default *MaxTasksPerNode* is 512. May not exceed 65533
So I'd try setting that and "scontrol reconfigure"
before attempting a recompile.
Seems the documentation is inconsistent on this point.
-Emre
Karl Lovink wrote:
> Hi Emre,
>
> MAX_TASKS_PER_NODE is set to 512. Does this means I cannot run more than
> 512 jobs in parallel on one node? Or can I change MAX_TASKS_PER_NODE to
> a higher value?
> And recompile slurm.....
>
> Regards,
> Karl
>
>
> On 14/09/2021 21:47, Emre Brookes wrote:
>> *-O*, *--overcommit*
>> Overcommit resources. When applied to job allocation, only one CPU
>> is allocated to the job per node and options used to specify the
>> number of tasks per node, socket, core, etc. are ignored. When
>> applied to job step allocations (the *srun* command when executed
>> within an existing job allocation), this option can be used to
>> launch more than one task per CPU. Normally, *srun* will not
>> allocate more than one process per CPU. By specifying *--overcommit*
>> you are explicitly allowing more than one process per CPU. However
>> no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per
>> node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/
>> and is not a variable, it is set at Slurm build time.
>>
>> I have used this successfully to run more jobs than cpus/cores avail.
>>
>> -e.
>>
>>
>>
>> Karl Lovink wrote:
>>> Hello,
>>>
>>> I am in the process of setting up our SLURM environment. We want to use
>>> SLURM during our DDoS exercises for dispatching DDoS attack scripts. We
>>> need a lot of parallel running jobs on a total of 3 nodes.I can't get it
>>> to run more than 128 jobs simultaneously. There are 128 cpu's in the
>>> compute nodes.
>>>
>>> How can I ensure that I can run more jobs in parallel than there are
>>> CPUs in the compute node?
>>>
>>> Thanks
>>> Karl
>>>
>>>
>>> My srun script is:
>>> srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh
>>>
>>> And my slurm.conf file:
>>> ClusterName=ddos-cluster
>>> ControlMachine=slurm
>>> SlurmUser=ddos
>>> SlurmctldPort=6817
>>> SlurmdPort=6818
>>> AuthType=auth/munge
>>> StateSaveLocation=/opt/slurm/spool/ctld
>>> SlurmdSpoolDir=/opt/slurm/spool/d
>>> SwitchType=switch/none
>>> MpiDefault=none
>>> SlurmctldPidFile=/opt/slurm/run/.pid
>>> SlurmdPidFile=/opt/slurm/run/slurmd.pid
>>> ProctrackType=proctrack/pgid
>>> PluginDir=/opt/slurm/lib/slurm
>>> ReturnToService=2
>>> TaskPlugin=task/none
>>> SlurmctldTimeout=300
>>> SlurmdTimeout=300
>>> InactiveLimit=0
>>> MinJobAge=300
>>> KillWait=30
>>> Waittime=0
>>> SchedulerType=sched/backfill
>>>
>>> SelectType=select/cons_tres
>>> SelectTypeParameters=CR_Core
>>>
>>> SlurmctldDebug=3
>>> SlurmctldLogFile=/opt/slurm/log/slurmctld.log
>>> SlurmdDebug=3
>>> SlurmdLogFile=/opt/slurm/log/slurmd.log
>>> JobCompType=jobcomp/none
>>> JobAcctGatherType=jobacct_gather/none
>>> AccountingStorageTRES=gres/gpu
>>> DebugFlags=CPU_Bind,gres
>>> AccountingStorageType=accounting_storage/slurmdbd
>>> AccountingStorageHost=localhost
>>> AccountingStoragePass=/var/run/munge/munge.socket.2
>>> AccountingStorageUser=slurm
>>> SlurmctldParameters=enable_configurable
>>> GresTypes=gpu
>>> DefMemPerNode=256000
>>> NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>>> NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>>> NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>>> PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>> PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>>
>>> .
>>>
> .
>
More information about the slurm-users
mailing list