Can Not Use A Single GPU for Multiple Jobs

List overview All Threads
Download

newer

older

error: unpack_header:...

How can I tell the OS that was...

Arnuld

20 Jun 2024 20 Jun '24

12:24 p.m.

I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs).

PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue.

I have this in slurm.conf and gres.conf:

# GPU GresTypes=gpu,shard # COMPUTE NODES PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN ---------------------- Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 Name=shard Count=3500 File=/dev/nvidia0

Attachments:

attachment.html (text/html — 1.1 KB)

Show replies by date

Brian Andrus

20 Jun 20 Jun

5:48 p.m.

Well, if I am reading this right, it makes sense.

Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run.

Brian Andrus

On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:

...

I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs).

PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue.

I have this in slurm.conf and gres.conf:

# GPU GresTypes=gpu,shard # COMPUTE NODES PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN

Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 Name=shard Count=3500 File=/dev/nvidia0

Shunran Zhang

21 Jun 21 Jun

2:12 a.m.

Arnuld,

You may be looking for the srun parameter or configuration option of "--oversubscribe" for CPU as that is the limiting factor now.

S. Zhang

On 2024/06/21 2:48, Brian Andrus via slurm-users wrote:

...

Well, if I am reading this right, it makes sense.

Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run.

Brian Andrus

On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:

...
I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs).

PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue.

I have this in slurm.conf and gres.conf:

# GPU GresTypes=gpu,shard # COMPUTE NODES PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN

Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 Name=shard Count=3500 File=/dev/nvidia0

Arnuld

10:50 a.m.

...

Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run.

I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs? This sbatch script requires 100 GPU cores, can;t we run 35 in parallel?

#! /usr/bin/env bash

#SBATCH --output="%j.out" #SBATCH --error="%j.error" #SBATCH --partition=pgpu #SBATCH --gres=shard:100

sleep 10 echo "Current date and time: $(date +"%Y-%m-%d %H:%M:%S")" echo "Running..." sleep 10

On Thu, Jun 20, 2024 at 11:23 PM Brian Andrus via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

Well, if I am reading this right, it makes sense.

Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run.

Brian Andrus

On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:

...
I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs).

PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue.

I have this in slurm.conf and gres.conf:

# GPU GresTypes=gpu,shard # COMPUTE NODES PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN

Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 Name=shard Count=3500 File=/dev/nvidia0

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Feng Zhang

5:21 p.m.

yes, the algorithm should be like that 1 cpu (core) per job(task). Like someone mentioned already, need to to --oversubscribe=10 on cpu cores, meaning 10 jobs on each core for you case. Slurm.conf. Best,

Feng

On Fri, Jun 21, 2024 at 6:52 AM Arnuld via slurm-users slurm-users@lists.schedmd.com wrote:

...

...
Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run.

I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs? This sbatch script requires 100 GPU cores, can;t we run 35 in parallel?

#! /usr/bin/env bash

#SBATCH --output="%j.out" #SBATCH --error="%j.error" #SBATCH --partition=pgpu #SBATCH --gres=shard:100

sleep 10 echo "Current date and time: $(date +"%Y-%m-%d %H:%M:%S")" echo "Running..." sleep 10

On Thu, Jun 20, 2024 at 11:23 PM Brian Andrus via slurm-users slurm-users@lists.schedmd.com wrote:

...
Well, if I am reading this right, it makes sense.

Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run.

Brian Andrus

On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:

...
I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs).

PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue.

I have this in slurm.conf and gres.conf:

# GPU GresTypes=gpu,shard # COMPUTE NODES PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN

Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 Name=shard Count=3500 File=/dev/nvidia0

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Christopher Samuel

7:14 p.m.

On 6/21/24 3:50 am, Arnuld via slurm-users wrote:

...

I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs?

No, Slurm has to launch the batch script on compute node cores and it then has the job of launching the users application that will run something on the node that will access the GPU(s).

Even with srun directly from a login node there's still processes that have to run on the compute node and those need at least a core (and some may need more, depending on the application).

-- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Arnuld

24 Jun 24 Jun

4:13 a.m.

...

No, Slurm has to launch the batch script on compute node cores ... SNIP... Even with srun directly from a login node there's still processes that have to run on the compute node and those need at least a core (and some may need more, depending on the application).

Alright, understood.

On Sat, Jun 22, 2024 at 12:47 AM Christopher Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

On 6/21/24 3:50 am, Arnuld via slurm-users wrote:

...
I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs?

No, Slurm has to launch the batch script on compute node cores and it then has the job of launching the users application that will run something on the node that will access the GPU(s).

Even with srun directly from a login node there's still processes that have to run on the compute node and those need at least a core (and some may need more, depending on the application).

-- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

411

Age (days ago)

415

Last active (days ago)

slurm-users@lists.schedmd.com

6 comments

5 participants

tags (0)

participants (5)

Arnuld
Brian Andrus
Christopher Samuel
Feng Zhang
Shunran Zhang