I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs).
PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue.
I have this in slurm.conf and gres.conf:
# GPU GresTypes=gpu,shard # COMPUTE NODES PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN ---------------------- Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 Name=shard Count=3500 File=/dev/nvidia0
Well, if I am reading this right, it makes sense.
Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run.
Brian Andrus
On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:
I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs).
PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue.
I have this in slurm.conf and gres.conf:
# GPU GresTypes=gpu,shard # COMPUTE NODES PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN
Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 Name=shard Count=3500 File=/dev/nvidia0
Arnuld,
You may be looking for the srun parameter or configuration option of "--oversubscribe" for CPU as that is the limiting factor now.
S. Zhang
On 2024/06/21 2:48, Brian Andrus via slurm-users wrote:
Well, if I am reading this right, it makes sense.
Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run.
Brian Andrus
On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:
I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs).
PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue.
I have this in slurm.conf and gres.conf:
# GPU GresTypes=gpu,shard # COMPUTE NODES PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN
Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 Name=shard Count=3500 File=/dev/nvidia0
Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run.
I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs? This sbatch script requires 100 GPU cores, can;t we run 35 in parallel?
#! /usr/bin/env bash
#SBATCH --output="%j.out" #SBATCH --error="%j.error" #SBATCH --partition=pgpu #SBATCH --gres=shard:100
sleep 10 echo "Current date and time: $(date +"%Y-%m-%d %H:%M:%S")" echo "Running..." sleep 10
On Thu, Jun 20, 2024 at 11:23 PM Brian Andrus via slurm-users < slurm-users@lists.schedmd.com> wrote:
Well, if I am reading this right, it makes sense.
Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run.
Brian Andrus
On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:
I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs).
PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue.
I have this in slurm.conf and gres.conf:
# GPU GresTypes=gpu,shard # COMPUTE NODES PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN
Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 Name=shard Count=3500 File=/dev/nvidia0
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
yes, the algorithm should be like that 1 cpu (core) per job(task). Like someone mentioned already, need to to --oversubscribe=10 on cpu cores, meaning 10 jobs on each core for you case. Slurm.conf. Best,
Feng
On Fri, Jun 21, 2024 at 6:52 AM Arnuld via slurm-users slurm-users@lists.schedmd.com wrote:
Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run.
I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs? This sbatch script requires 100 GPU cores, can;t we run 35 in parallel?
#! /usr/bin/env bash
#SBATCH --output="%j.out" #SBATCH --error="%j.error" #SBATCH --partition=pgpu #SBATCH --gres=shard:100
sleep 10 echo "Current date and time: $(date +"%Y-%m-%d %H:%M:%S")" echo "Running..." sleep 10
On Thu, Jun 20, 2024 at 11:23 PM Brian Andrus via slurm-users slurm-users@lists.schedmd.com wrote:
Well, if I am reading this right, it makes sense.
Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run.
Brian Andrus
On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:
I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs).
PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue.
I have this in slurm.conf and gres.conf:
# GPU GresTypes=gpu,shard # COMPUTE NODES PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN
Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 Name=shard Count=3500 File=/dev/nvidia0
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 6/21/24 3:50 am, Arnuld via slurm-users wrote:
I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs?
No, Slurm has to launch the batch script on compute node cores and it then has the job of launching the users application that will run something on the node that will access the GPU(s).
Even with srun directly from a login node there's still processes that have to run on the compute node and those need at least a core (and some may need more, depending on the application).
No, Slurm has to launch the batch script on compute node cores ... SNIP... Even with srun directly from a login node there's still processes that have to run on the compute node and those need at least a core (and some may need more, depending on the application).
Alright, understood.
On Sat, Jun 22, 2024 at 12:47 AM Christopher Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:
On 6/21/24 3:50 am, Arnuld via slurm-users wrote:
I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs?
No, Slurm has to launch the batch script on compute node cores and it then has the job of launching the users application that will run something on the node that will access the GPU(s).
Even with srun directly from a login node there's still processes that have to run on the compute node and those need at least a core (and some may need more, depending on the application).
-- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com