[slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

Fri Jan 19 15:59:28 UTC 2024

If you run "scontrol show jobid <jobid>" of your pending job with the "(Resources)" tag you may see more about what is unavailable to your job.  Slurm default configs can cause an entire compute node of resources to be "allocated" to a running job regardless of whether it needs all of them or not so you may need to alter one or both of the following settings to allow more than one job to run on a single node at once.  You'll find these in your slurm.conf.  Don't forget to "scontrol reconf" and even potentially restart both "slurmctld" & "slurmd" on your nodes if you do end up making changes.

SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory

I hope this helps.

Kind regards,
Jason

----

Jason Macklin

Manager Cyberinfrastructure, Research Cyberinfrastructure

860.837.2142 t | 860.202.7779 m

jason.macklin at jax.org

The Jackson Laboratory

Maine | Connecticut | California | Shanghai

www.jax.org<http://www.jax.org>

The Jackson Laboratory: Leading the search for tomorrow's cures

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of slurm-users-request at lists.schedmd.com <slurm-users-request at lists.schedmd.com>
Sent: Friday, January 19, 2024 9:24 AM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: [EXTERNAL]slurm-users Digest, Vol 75, Issue 31

Send slurm-users mailing list submissions to
        slurm-users at lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
        slurm-users-request at lists.schedmd.com

You can reach the person managing the list at
        slurm-users-owner at lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."

Today's Topics:

   1. Re: Need help with running multiple instances/executions of a
      batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
      (Marko Markoc)
   2. Re: Need help with running multiple instances/executions of a
      batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
      (?mit Seren)

----------------------------------------------------------------------

Message: 1
Date: Fri, 19 Jan 2024 06:12:24 -0800
From: Marko Markoc <mmarkoc at pdx.edu>
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Need help with running multiple
        instances/executions of a batch script in parallel (with NVIDIA HGX
        A100 GPU as a Gres)
Message-ID:
        <CABnuMe4JTA0e6=VbO8D+To=8FGO+3Byv1dK_MC+OuRitzN5dXg at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

+1 on checking the memory allocation.
Or add/check if you have any DefMemPerX set in your slurm.conf

On Fri, Jan 19, 2024 at 12:33?AM mohammed shambakey <shambakey1 at gmail.com>
wrote:

> Hi
>
> I'm not an expert, but is it possible that the currently running jobs is
> consuming the whole node because it is allocated the whole memory of the
> node (so the other 2 jobs had to wait until it finishes)?
> Maybe if you try to restrict the required memory for each job?
>
> Regards
>
> On Thu, Jan 18, 2024 at 4:46?PM ?mit Seren <uemit.seren at gmail.com> wrote:
>
>> This line also has tobe changed:
>>
>>
>> #SBATCH --gpus-per-node=4 ? #SBATCH --gpus-per-node=1
>>
>> --gpus-per-node seems to be the new parameter that is replacing the  --gres=
>> one, so you can remove the ?gres line completely.
>>
>>
>>
>> Best
>>
>> ?mit
>>
>>
>>
>> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
>> Kherfani, Hafedh (Professional Services, TC) <hafedh.kherfani at hpe.com>
>> *Date: *Thursday, 18. January 2024 at 15:40
>> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
>> *Subject: *Re: [slurm-users] Need help with running multiple
>> instances/executions of a batch script in parallel (with NVIDIA HGX A100
>> GPU as a Gres)
>>
>> Hi Noam and Matthias,
>>
>>
>>
>> Thanks both for your answers.
>>
>>
>>
>> I changed the ?#SBATCH --gres=gpu:4? directive (in the batch script) with
>> ?#SBATCH --gres=gpu:1? as you suggested, but it didn?t make a difference,
>> as running this batch script 3 times will result in the first job to be in
>> a running state, while the second and third jobs will still be in a pending
>> state ?
>>
>>
>>
>> [slurmtest at c-a100-master test-batch-scripts]$ cat gpu-job.sh
>>
>> #!/bin/bash
>>
>> #SBATCH --job-name=gpu-job
>>
>> #SBATCH --partition=gpu
>>
>> #SBATCH --nodes=1
>>
>> #SBATCH --gpus-per-node=4
>>
>> #SBATCH --gres=gpu:1                            # <<<< Changed from ?4?
>> to ?1?
>>
>> #SBATCH --tasks-per-node=1
>>
>> #SBATCH --output=gpu_job_output.%j
>>
>> #SBATCH --error=gpu_job_error.%j
>>
>>
>>
>> hostname
>>
>> date
>>
>> sleep 40
>>
>> pwd
>>
>>
>>
>> [slurmtest at c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
>>
>> Submitted batch job *217*
>>
>> [slurmtest at c-a100-master test-batch-scripts]$ squeue
>>
>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>> NODELIST(REASON)
>>
>>                217       gpu  gpu-job slurmtes  R       0:02      1
>> c-a100-cn01
>>
>> [slurmtest at c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
>>
>> Submitted batch job *218*
>>
>> [slurmtest at c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
>>
>> Submitted batch job *219*
>>
>> [slurmtest at c-a100-master test-batch-scripts]$ squeue
>>
>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>> NODELIST(REASON)
>>
>>                219       gpu  gpu-job slurmtes *PD*       0:00      1
>> (Priority)
>>
>>                218       gpu  gpu-job slurmtes *PD*       0:00      1
>> (Resources)
>>
>>                217       gpu  gpu-job slurmtes  *R*       0:07      1
>> c-a100-cn01
>>
>>
>>
>> Basically I?m seeking for some help/hints on how to tell Slurm, from the
>> batch script for example: ?I want only 1 or 2 GPUs to be used/consumed by
>> the job?, and then I run the batch script/job a couple of times with sbatch
>> command, and confirm that we can indeed have multiple jobs using a GPU and
>> running in parallel, at the same time.
>>
>>
>>
>> Makes sense ?
>>
>>
>>
>>
>>
>> Best regards,
>>
>>
>>
>> *Hafedh *
>>
>>
>>
>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
>> Of *Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
>> *Sent:* jeudi 18 janvier 2024 2:30 PM
>> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>> *Subject:* Re: [slurm-users] Need help with running multiple
>> instances/executions of a batch script in parallel (with NVIDIA HGX A100
>> GPU as a Gres)
>>
>>
>>
>> On Jan 18, 2024, at 7:31 AM, Matthias Loose <m.loose at mindcode.de> wrote:
>>
>>
>>
>> Hi Hafedh,
>>
>> Im no expert in the GPU side of SLURM, but looking at you current
>> configuration to me its working as intended at the moment. You have defined
>> 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait
>> for the ressource the be free again.
>>
>> I think what you need to look into is the MPS plugin, which seems to do
>> what you are trying to achieve:
>> https://slurm.schedmd.com/gres.html#MPS_Management
>>
>>
>>
>> I agree with the first paragraph.  How many GPUs are you expecting each
>> job to use? I'd have assumed, based on the original text, that each job is
>> supposed to use 1 GPU, and the 4 jobs were supposed to be running
>> side-by-side on the one node you have (with 4 GPUs).  If so, you need to
>> tell each job to request only 1 GPU, and currently each one is requesting 4.
>>
>>
>>
>> If your jobs are actually supposed to be using 4 GPUs each, I still don't
>> see any advantage to MPS (at least in what is my usual GPU usage pattern):
>> all the jobs will take longer to finish, because they are sharing the fixed
>> resource. If they take turns, at least the first ones finish as fast as
>> they can, and the last one will finish no later than it would have if they
>> were all time-sharing the GPUs.  I guess NVIDIA had something in mind when
>> they developed MPS, so I guess our pattern may not be typical (or at least
>> not universal), and in that case the MPS plugin may well be what you need.
>>
>
>
> --
> Mohammed
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20240119/7a49a9e7/attachment-0001.htm>

------------------------------

Message: 2
Date: Fri, 19 Jan 2024 15:24:17 +0100
From: ?mit Seren <uemit.seren at gmail.com>
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Need help with running multiple
        instances/executions of a batch script in parallel (with NVIDIA HGX
        A100 GPU as a Gres)
Message-ID:
        <CANBYW4ACFtNwawVc8WqcGXgBOAq6v_eeHHX9mXGdgbUs_D=EyA at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Maybe also post the output of scontrol show job <jobid> to check the other
resources allocated for the job.

On Thu, Jan 18, 2024, 19:22 Kherfani, Hafedh (Professional Services, TC) <
hafedh.kherfani at hpe.com> wrote:

> Hi ?mit, Troy,
>
>
>
> I removed the line ?#SBATCH --gres=gpu:1?, and changed the sbatch
> directive ?--gpus-per-node=4? to ?--gpus-per-node=1?, but still getting the
> same result: When running multiple sbatch commands for the same script,
> only one job (first execution) is running, and all subsequent jobs are in a
> pending state (REASON being reported as ?Resources? for immediately next
> job in the queue, and ?Priority? for remaining ones) ?
>
>
>
> As for the output from ?scontrol show job <jobid>? command: I don?t see a ?TRES?
> field on its own .. I see the field ?TresPerNode=gres/gpu:1? (the value in
> the end f the line will correspond to the value specified in the ?--gpus-per-node=?
> directive.
>
>
>
> PS: Is it normal/expected (in the output of scontrol show job command) to
> have ?Features=(null)? ? I was expecting to see Features=gpu ?.
>
>
>
>
>
> Best regards,
>
>
>
> *Hafedh *
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Baer, Troy
> *Sent:* jeudi 18 janvier 2024 3:47 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] Need help with running multiple
> instances/executions of a batch script in parallel (with NVIDIA HGX A100
> GPU as a Gres)
>
>
>
> Hi Hafedh,
>
>
>
> Your job script has the sbatch directive ??gpus-per-node=4? set.  I
> suspect that if you look at what?s allocated to the running job by doing
> ?scontrol show job <jobid>? and looking at the TRES field, it?s been
> allocated 4 GPUs instead of one.
>
>
>
> Regards,
>
>                 --Troy
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Kherfani, Hafedh (Professional Services, TC)
> *Sent:* Thursday, January 18, 2024 9:38 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] Need help with running multiple
> instances/executions of a batch script in parallel (with NVIDIA HGX A100
> GPU as a Gres)
>
>
>
> Hi Noam and Matthias, Thanks both for your answers. I changed the ?#SBATCH
> --gres=gpu: 4? directive (in the batch script) with ?#SBATCH --gres=gpu: 1?
> as you suggested, but it didn?t make a difference, as running
>
>
>
> Hi Noam and Matthias,
>
>
>
> Thanks both for your answers.
>
>
>
> I changed the ?#SBATCH --gres=gpu:4? directive (in the batch script) with
> ?#SBATCH --gres=gpu:1? as you suggested, but it didn?t make a difference,
> as running this batch script 3 times will result in the first job to be in
> a running state, while the second and third jobs will still be in a pending
> state ?
>
>
>
> [slurmtest at c-a100-master test-batch-scripts]$ cat gpu-job.sh
>
> #!/bin/bash
>
> #SBATCH --job-name=gpu-job
>
> #SBATCH --partition=gpu
>
> #SBATCH --nodes=1
>
> #SBATCH --gpus-per-node=4
>
> #SBATCH --gres=gpu:1                            # <<<< Changed from ?4? to
> ?1?
>
> #SBATCH --tasks-per-node=1
>
> #SBATCH --output=gpu_job_output.%j
>
> #SBATCH --error=gpu_job_error.%j
>
>
>
> hostname
>
> date
>
> sleep 40
>
> pwd
>
>
>
> [slurmtest at c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
>
> Submitted batch job *217*
>
> [slurmtest at c-a100-master test-batch-scripts]$ squeue
>
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>
>                217       gpu  gpu-job slurmtes  R       0:02      1
> c-a100-cn01
>
> [slurmtest at c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
>
> Submitted batch job *218*
>
> [slurmtest at c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
>
> Submitted batch job *219*
>
> [slurmtest at c-a100-master test-batch-scripts]$ squeue
>
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>
>                219       gpu  gpu-job slurmtes *PD*       0:00      1
> (Priority)
>
>                218       gpu  gpu-job slurmtes *PD*       0:00      1
> (Resources)
>
>                217       gpu  gpu-job slurmtes  *R*       0:07      1
> c-a100-cn01
>
>
>
> Basically I?m seeking for some help/hints on how to tell Slurm, from the
> batch script for example: ?I want only 1 or 2 GPUs to be used/consumed by
> the job?, and then I run the batch script/job a couple of times with sbatch
> command, and confirm that we can indeed have multiple jobs using a GPU and
> running in parallel, at the same time.
>
>
>
> Makes sense ?
>
>
>
>
>
> Best regards,
>
>
>
> *Hafedh *
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
> *Sent:* jeudi 18 janvier 2024 2:30 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] Need help with running multiple
> instances/executions of a batch script in parallel (with NVIDIA HGX A100
> GPU as a Gres)
>
>
>
> On Jan 18, 2024, at 7:31 AM, Matthias Loose <m.loose at mindcode.de> wrote:
>
>
>
> Hi Hafedh,
>
> Im no expert in the GPU side of SLURM, but looking at you current
> configuration to me its working as intended at the moment. You have defined
> 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait
> for the ressource the be free again.
>
> I think what you need to look into is the MPS plugin, which seems to do
> what you are trying to achieve:
> https://slurm.schedmd.com/gres.html#MPS_Management
>
>
>
> I agree with the first paragraph.  How many GPUs are you expecting each
> job to use? I'd have assumed, based on the original text, that each job is
> supposed to use 1 GPU, and the 4 jobs were supposed to be running
> side-by-side on the one node you have (with 4 GPUs).  If so, you need to
> tell each job to request only 1 GPU, and currently each one is requesting 4.
>
>
>
> If your jobs are actually supposed to be using 4 GPUs each, I still don't
> see any advantage to MPS (at least in what is my usual GPU usage pattern):
> all the jobs will take longer to finish, because they are sharing the fixed
> resource. If they take turns, at least the first ones finish as fast as
> they can, and the last one will finish no later than it would have if they
> were all time-sharing the GPUs.  I guess NVIDIA had something in mind when
> they developed MPS, so I guess our pattern may not be typical (or at least
> not universal), and in that case the MPS plugin may well be what you need.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20240119/6968dcff/attachment.htm>

End of slurm-users Digest, Vol 75, Issue 31
*******************************************
---

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20240119/a0a486c4/attachment-0001.htm>