[slurm-users] Running multi jobs on one CPU in parallel

Tue Sep 14 21:42:40 UTC 2021

The simplest approach might be to run multiple processes within each batch job.

Gareth

Get Outlook for Android<https://aka.ms/ghei36>
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Emre Brookes <emre.brookes at mso.umt.edu>
Sent: Wednesday, September 15, 2021 6:42:24 AM
To: Karl Lovink <karl at lovink.net>; Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Running multi jobs on one CPU in parallel

Hi Karl,

I haven't tested the MAX_TASKS_PER_NODE limits.
According to slurm.conf

*MaxTasksPerNode*
    Maximum number of tasks Slurm will allow a job step to spawn on a
    single node.
    The default *MaxTasksPerNode* is 512. May not exceed 65533

So I'd try setting that and "scontrol reconfigure"
before attempting a recompile.
Seems the documentation is inconsistent on this point.

-Emre

Karl Lovink wrote:
> Hi Emre,
>
> MAX_TASKS_PER_NODE is set to 512. Does this means I cannot run more than
> 512 jobs in parallel on one node? Or can I change MAX_TASKS_PER_NODE to
> a higher value?
> And recompile slurm.....
>
> Regards,
> Karl
>
>
> On 14/09/2021 21:47, Emre Brookes wrote:
>> *-O*, *--overcommit*
>>     Overcommit resources. When applied to job allocation, only one CPU
>>     is allocated to the job per node and options used to specify the
>>     number of tasks per node, socket, core, etc. are ignored. When
>>     applied to job step allocations (the *srun* command when executed
>>     within an existing job allocation), this option can be used to
>>     launch more than one task per CPU. Normally, *srun* will not
>>     allocate more than one process per CPU. By specifying *--overcommit*
>>     you are explicitly allowing more than one process per CPU. However
>>     no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per
>>     node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/
>>     and is not a variable, it is set at Slurm build time.
>>
>> I have used this successfully to run more jobs than cpus/cores avail.
>>
>> -e.
>>
>>
>>
>> Karl Lovink wrote:
>>> Hello,
>>>
>>> I am in the process of setting up our SLURM environment. We want to use
>>> SLURM during our DDoS exercises for dispatching DDoS attack scripts. We
>>> need a lot of parallel running jobs on a total of 3 nodes.I can't get it
>>> to run more than 128 jobs simultaneously. There are 128 cpu's in the
>>> compute nodes.
>>>
>>> How can I ensure that I can run more jobs in parallel than there are
>>> CPUs in the compute node?
>>>
>>> Thanks
>>> Karl
>>>
>>>
>>> My srun script is:
>>> srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh
>>>
>>> And my slurm.conf file:
>>> ClusterName=ddos-cluster
>>> ControlMachine=slurm
>>> SlurmUser=ddos
>>> SlurmctldPort=6817
>>> SlurmdPort=6818
>>> AuthType=auth/munge
>>> StateSaveLocation=/opt/slurm/spool/ctld
>>> SlurmdSpoolDir=/opt/slurm/spool/d
>>> SwitchType=switch/none
>>> MpiDefault=none
>>> SlurmctldPidFile=/opt/slurm/run/.pid
>>> SlurmdPidFile=/opt/slurm/run/slurmd.pid
>>> ProctrackType=proctrack/pgid
>>> PluginDir=/opt/slurm/lib/slurm
>>> ReturnToService=2
>>> TaskPlugin=task/none
>>> SlurmctldTimeout=300
>>> SlurmdTimeout=300
>>> InactiveLimit=0
>>> MinJobAge=300
>>> KillWait=30
>>> Waittime=0
>>> SchedulerType=sched/backfill
>>>
>>> SelectType=select/cons_tres
>>> SelectTypeParameters=CR_Core
>>>
>>> SlurmctldDebug=3
>>> SlurmctldLogFile=/opt/slurm/log/slurmctld.log
>>> SlurmdDebug=3
>>> SlurmdLogFile=/opt/slurm/log/slurmd.log
>>> JobCompType=jobcomp/none
>>> JobAcctGatherType=jobacct_gather/none
>>> AccountingStorageTRES=gres/gpu
>>> DebugFlags=CPU_Bind,gres
>>> AccountingStorageType=accounting_storage/slurmdbd
>>> AccountingStorageHost=localhost
>>> AccountingStoragePass=/var/run/munge/munge.socket.2
>>> AccountingStorageUser=slurm
>>> SlurmctldParameters=enable_configurable
>>> GresTypes=gpu
>>> DefMemPerNode=256000
>>> NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>>> NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>>> NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>>> PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>> PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>>
>>> .
>>>
> .
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210914/a47077b3/attachment.htm>