avoid using same GPU by the interactive job

List overview All Threads
Download

newer

older

Create filenames based on slurm...

error no error

navin srivastava

12 Feb 2025 12 Feb '25

4:29 p.m.

hi,

facing an issue in my environment where the batch job and the interactive job use the same gpu.

Each server has 2 gpu. When 2 batch jobs are running it works fine and use the 2 different gpu's. but if one batch job is running and another job is submitted interactively then it uses the same GPU . Is there a way to avoid this?

GresTypes=gpu NodeName=node[01-02] NodeAddr=node[01-02] CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 TmpDisk=6000000 RealMemory=515634 Feature=A100 Gres=gpu:2

PartitionName=onprem Nodes=node[01-10] Default=YES MaxTime=21-00:00:00 DefaultTime=3-00:00:00 State=UP Shared=YES OverSubscribe=NO

gres.conf: Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1

Any suggestions on this.

Regards Navin

Attachments:

attachment.html (text/html — 1.0 KB)

Show replies by date

Chintanadilok, Jesse

12 Feb 12 Feb

5:54 p.m.

New subject: [EXTERNAL] avoid using same GPU by the interactive job

Navin,

You can isolate GPUs per job if you have cgroups set up properly. What OS are you using? Newer OSes will support cgroupsv2 out of the box, but if necessary you can continue using v1, this workflow should be applicable for both.

Add ConstrainDevices=yes to your cgroup.conf

This is what the file looks like at my site: /etc/slurm/cgroup.conf CgroupMountpoint="/sys/fs/cgroup" ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=no ConstrainDevices=yes

You can find the documentation here: https://slurm.schedmd.com/cgroup.conf.html

If you want to share GPUs you can use CUDA MPS or MIG if your GPU supports it.

Regards, Jesse Chintanadilok

From: navin srivastava via slurm-users slurm-users@lists.schedmd.com Sent: Wednesday, February 12, 2025 10:30 To: Slurm User Community List slurm-users@lists.schedmd.com Subject: [EXTERNAL] [slurm-users] avoid using same GPU by the interactive job

hi, facing an issue in my environment where the batch job and the interactive job use the same gpu. Each server has 2 gpu. When 2 batch jobs are running it works fine and use the 2 different gpu's. but if one batch job is running and another ZjQcmQRYFpfptBannerStart This message was sent from outside of Texas Instruments.

Do not click links or open attachments unless you recognize the source of this email and know the content is safe.

Report Suspicious https://us-phishalarm-ewt.proofpoint.com/EWT/v1/G3vK!tDdkczjudcZWZCqpHP6Ikzi-El1-dpSwALBpmsdoXJOODQgC9RVWKYSLBAkkSja6JDeYPDDqYANiCMm4xgWAtpPabtvdEeWe5cMxQWuw7pV_l7LSV6lbgQ$ ‌

ZjQcmQRYFpfptBannerEnd hi,

facing an issue in my environment where the batch job and the interactive job use the same gpu.

GresTypes=gpu NodeName=node[01-02] NodeAddr=node[01-02] CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 TmpDisk=6000000 RealMemory=515634 Feature=A100 Gres=gpu:2

PartitionName=onprem Nodes=node[01-10] Default=YES MaxTime=21-00:00:00 DefaultTime=3-00:00:00 State=UP Shared=YES OverSubscribe=NO

gres.conf: Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1

Any suggestions on this.

Regards Navin

navin srivastava

13 Feb 13 Feb

4:57 a.m.

New subject: [EXTERNAL] avoid using same GPU by the interactive job

Thank you Jesse.

I am using Enterprise SLES15SP6 as the OS. I have not introduced the cgroup functionality in my environment. I can think about it and will see if this solution works out. but is there any other way to use without Cgroup to achieve the same. Batch job requests are fine 2 jobs with each one GPU request works fine. in the case of mix( 1 batch job and other Interactive job) creating the problem.

Is there a way I can run a job and apply the exclusive way only on GPU resources?

Regards Navin.

On Wed, Feb 12, 2025 at 11:24 PM Chintanadilok, Jesse jchin@ti.com wrote:

...

Navin,

You can isolate GPUs per job if you have cgroups set up properly. What OS are you using? Newer OSes will support cgroupsv2 out of the box, but if necessary you can continue using v1, this workflow should be applicable for both.

Add ConstrainDevices=yes to your cgroup.conf

This is what the file looks like at my site:

/etc/slurm/cgroup.conf

CgroupMountpoint="/sys/fs/cgroup"

ConstrainCores=yes

ConstrainRAMSpace=yes

ConstrainSwapSpace=no

ConstrainDevices=yes

You can find the documentation here:

https://slurm.schedmd.com/cgroup.conf.html

If you want to share GPUs you can use CUDA MPS or MIG if your GPU supports it.

Regards,

Jesse Chintanadilok

*From:* navin srivastava via slurm-users slurm-users@lists.schedmd.com *Sent:* Wednesday, February 12, 2025 10:30 *To:* Slurm User Community List slurm-users@lists.schedmd.com *Subject:* [EXTERNAL] [slurm-users] avoid using same GPU by the interactive job

hi, facing an issue in my environment where the batch job and the interactive job use the same gpu. Each server has 2 gpu. When 2 batch jobs are running it works fine and use the 2 different gpu's. but if one batch job is running and another

ZjQcmQRYFpfptBannerStart

*This message was sent from outside of Texas Instruments. *

Do not click links or open attachments unless you recognize the source of this email and know the content is safe.

Report Suspicious *

https://us-phishalarm-ewt.proofpoint.com/EWT/v1/G3vK!tDdkczjudcZWZCqpHP6Ikzi-El1-dpSwALBpmsdoXJOODQgC9RVWKYSLBAkkSja6JDeYPDDqYANiCMm4xgWAtpPabtvdEeWe5cMxQWuw7pV_l7LSV6lbgQ$ ‌

ZjQcmQRYFpfptBannerEnd

hi,

facing an issue in my environment where the batch job and the interactive job use the same gpu.

Each server has 2 gpu. When 2 batch jobs are running it works fine and use the 2 different gpu's. but if one batch job is running and another job is submitted interactively then it uses the same GPU . Is there a way to avoid this?

GresTypes=gpu

NodeName=node[01-02] NodeAddr=node[01-02] CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 TmpDisk=6000000 RealMemory=515634 Feature=A100 Gres=gpu:2

PartitionName=onprem Nodes=node[01-10] Default=YES MaxTime=21-00:00:00 DefaultTime=3-00:00:00 State=UP Shared=YES OverSubscribe=NO

gres.conf:

Name=gpu File=/dev/nvidia0

Name=gpu File=/dev/nvidia1

Any suggestions on this.

Regards

Navin

Michael Gutteridge

5:04 p.m.

New subject: [EXTERNAL] avoid using same GPU by the interactive job

Well that's kind of the core issue- without cgroups _any_ process in the job will have access to all of the GPUs on the system and there's not much more that Slurm can do about it at that point.

I would have a look at the environment variable CUDA_VISIBLE_DEVICES https://slurm.schedmd.com/gres.html#GPU_Management. It is set by Slurm and should have an index (0, 1, 2, etc.) directing applications to an appropriate GPU. I think it's more a case that the batch processes are honoring that variable and the interactive job is not.

- Michael

On Wed, Feb 12, 2025 at 9:00 PM navin srivastava via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

Thank you Jesse.

I am using Enterprise SLES15SP6 as the OS. I have not introduced the cgroup functionality in my environment. I can think about it and will see if this solution works out. but is there any other way to use without Cgroup to achieve the same. Batch job requests are fine 2 jobs with each one GPU request works fine. in the case of mix( 1 batch job and other Interactive job) creating the problem.

Is there a way I can run a job and apply the exclusive way only on GPU resources?

Regards Navin.

On Wed, Feb 12, 2025 at 11:24 PM Chintanadilok, Jesse jchin@ti.com wrote:

...
Navin,

You can isolate GPUs per job if you have cgroups set up properly. What OS are you using? Newer OSes will support cgroupsv2 out of the box, but if necessary you can continue using v1, this workflow should be applicable for both.

Add ConstrainDevices=yes to your cgroup.conf

This is what the file looks like at my site:

/etc/slurm/cgroup.conf

CgroupMountpoint="/sys/fs/cgroup"

ConstrainCores=yes

ConstrainRAMSpace=yes

ConstrainSwapSpace=no

ConstrainDevices=yes

You can find the documentation here:

https://slurm.schedmd.com/cgroup.conf.html

If you want to share GPUs you can use CUDA MPS or MIG if your GPU supports it.

Regards,

Jesse Chintanadilok

*From:* navin srivastava via slurm-users slurm-users@lists.schedmd.com *Sent:* Wednesday, February 12, 2025 10:30 *To:* Slurm User Community List slurm-users@lists.schedmd.com *Subject:* [EXTERNAL] [slurm-users] avoid using same GPU by the interactive job

hi, facing an issue in my environment where the batch job and the interactive job use the same gpu. Each server has 2 gpu. When 2 batch jobs are running it works fine and use the 2 different gpu's. but if one batch job is running and another

ZjQcmQRYFpfptBannerStart

*This message was sent from outside of Texas Instruments. *

Do not click links or open attachments unless you recognize the source of this email and know the content is safe.

Report Suspicious *

https://us-phishalarm-ewt.proofpoint.com/EWT/v1/G3vK!tDdkczjudcZWZCqpHP6Ikzi-El1-dpSwALBpmsdoXJOODQgC9RVWKYSLBAkkSja6JDeYPDDqYANiCMm4xgWAtpPabtvdEeWe5cMxQWuw7pV_l7LSV6lbgQ$ ‌

ZjQcmQRYFpfptBannerEnd

hi,

facing an issue in my environment where the batch job and the interactive job use the same gpu.

Each server has 2 gpu. When 2 batch jobs are running it works fine and use the 2 different gpu's. but if one batch job is running and another job is submitted interactively then it uses the same GPU . Is there a way to avoid this?

GresTypes=gpu

NodeName=node[01-02] NodeAddr=node[01-02] CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 TmpDisk=6000000 RealMemory=515634 Feature=A100 Gres=gpu:2

PartitionName=onprem Nodes=node[01-10] Default=YES MaxTime=21-00:00:00 DefaultTime=3-00:00:00 State=UP Shared=YES OverSubscribe=NO

gres.conf:

Name=gpu File=/dev/nvidia0

Name=gpu File=/dev/nvidia1

Any suggestions on this.

Regards

Navin

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

177

Age (days ago)

178

Last active (days ago)

slurm-users@lists.schedmd.com

3 comments

3 participants

tags (0)

participants (3)

Chintanadilok, Jesse
Michael Gutteridge
navin srivastava