[slurm-users] [EXT] GPU Jobs with Slurm

Fulcomer, Samuel samuel_fulcomer at brown.edu
Thu Jan 14 23:36:44 UTC 2021


AllowedDevicesFile should not be necessary. The relevant devices are
identified in gres.conf. "ConstrainDevices=yes" should be all that's needed.

nvidia-smi will only see the allocated GPUs. Note that a single allocated
GPU will always be shown by nvidia-smi to be GPU 0, regardless of its
actual hardware ordinal, and GPU_DEVICE_ORDINAL will be set to 0. The value
of SLURM_STEP_GPUS will be set to the actual device number (N, where the
device is /dev/nvidiaN).

On Thu, Jan 14, 2021 at 6:20 PM Ryan Novosielski <novosirj at rutgers.edu>
wrote:

> AFAIK, if you have this set up correctly, nvidia-smi will be restricted
> too, though I think we were seeing a bug there at one time in this version.
>
> --
> #BlackLivesMatter
> ____
> || \\UTGERS,
> |---------------------------*O*---------------------------
> ||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\    of NJ     | Office of Advanced Research Computing - MSB C630,
> Newark
>     `'
>
> On Jan 14, 2021, at 18:05, Abhiram Chintangal <achintangal at berkeley.edu>
> wrote:
>
> 
> Sean,
>
> Thanks for the clarification.I noticed that I am missing the
> "AllowedDevices" option in mine. After adding this, the GPU allocations
> started working. (Slurm version 18.08.8)
>
> I was also incorrectly using "nvidia-smi" as a check.
>
> Regards,
>
> Abhiram
>
> On Thu, Jan 14, 2021 at 12:22 AM Sean Crosby <scrosby at unimelb.edu.au>
> wrote:
>
>> Hi Abhiram,
>>
>> You need to configure cgroup.conf to constrain the devices a job has
>> access to. See https://slurm.schedmd.com/cgroup.conf.html
>>
>> My cgroup.conf is
>>
>> CgroupAutomount=yes
>> AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf"
>>
>> ConstrainCores=yes
>> ConstrainRAMSpace=yes
>> ConstrainSwapSpace=yes
>> ConstrainDevices=yes
>>
>> TaskAffinity=no
>>
>> CgroupMountpoint=/sys/fs/cgroup
>>
>> The ConstrainDevices=yes is the key to stopping jobs from having access
>> to GPUs they didn't request.
>>
>> Sean
>>
>> --
>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>> Research Computing Services | Business Services
>> The University of Melbourne, Victoria 3010 Australia
>>
>>
>>
>> On Thu, 14 Jan 2021 at 18:36, Abhiram Chintangal <
>> achintangal at berkeley.edu> wrote:
>>
>>> * UoM notice: External email. Be cautious of links, attachments, or
>>> impersonation attempts *
>>> ------------------------------
>>> Hello,
>>>
>>> I recently set up a small cluster at work using Warewulf/Slurm.
>>> Currently, I am not able to get the scheduler to
>>> work well with GPU's (Gres).
>>>
>>> While slurm is able to filter by GPU type, it allocates all the GPU's on
>>> the node. See below:
>>>
>>> [abhiram at whale ~]$ srun --gres=gpu:p100:2 -n 1 --partition=gpu
>>>> nvidia-smi --query-gpu=index,name --format=csv
>>>> index, name
>>>> 0, Tesla P100-PCIE-16GB
>>>> 1, Tesla P100-PCIE-16GB
>>>> 2, Tesla P100-PCIE-16GB
>>>> 3, Tesla P100-PCIE-16GB
>>>> [abhiram at whale ~]$ srun --gres=gpu:titanrtx:2 -n 1 --partition=gpu
>>>> nvidia-smi --query-gpu=index,name --format=csv
>>>> index, name
>>>> 0, TITAN RTX
>>>> 1, TITAN RTX
>>>> 2, TITAN RTX
>>>> 3, TITAN RTX
>>>> 4, TITAN RTX
>>>> 5, TITAN RTX
>>>> 6, TITAN RTX
>>>> 7, TITAN RTX
>>>>
>>>
>>> I am fairly new to Slurm and still figuring out my way around it. I
>>> would really appreciate any help with this.
>>>
>>> For your reference, I attached the slurm.conf and gres.conf files.
>>>
>>> Best,
>>>
>>> Abhiram
>>>
>>> --
>>>
>>> Abhiram Chintangal
>>> QB3 Nogales Lab
>>> Bioinformatics Specialist @ Howard Hughes Medical Institute
>>> University of California Berkeley
>>> 708D Stanley Hall, Berkeley, CA 94720
>>> Phone (510)666-3344
>>>
>>>
>
> --
>
> Abhiram Chintangal
> QB3 Nogales Lab
> Bioinformatics Specialist @ Howard Hughes Medical Institute
> University of California Berkeley
> 708D Stanley Hall, Berkeley, CA 94720
> Phone (510)666-3344
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210114/487193da/attachment-0001.htm>


More information about the slurm-users mailing list