[slurm-users] [EXT] GPU Jobs with Slurm

Thu Jan 14 23:56:06 UTC 2021

Also note that there was a bug in an older version of SLURM
(pre-17-something) that corrupted the database in a way that prevented
GPU/gres fencing. If that affected you and you're still using the same
database, GPU fencing probably isn't working. There's a way of fixing this
manually through sql hacking; however, we just went with a virgin database
when we last upgraded in order to get it working (and sucked the accounting
data into XDMoD).

On Thu, Jan 14, 2021 at 6:36 PM Fulcomer, Samuel <samuel_fulcomer at brown.edu>
wrote:

> AllowedDevicesFile should not be necessary. The relevant devices are
> identified in gres.conf. "ConstrainDevices=yes" should be all that's needed.
>
> nvidia-smi will only see the allocated GPUs. Note that a single allocated
> GPU will always be shown by nvidia-smi to be GPU 0, regardless of its
> actual hardware ordinal, and GPU_DEVICE_ORDINAL will be set to 0. The value
> of SLURM_STEP_GPUS will be set to the actual device number (N, where the
> device is /dev/nvidiaN).
>
> On Thu, Jan 14, 2021 at 6:20 PM Ryan Novosielski <novosirj at rutgers.edu>
> wrote:
>
>> AFAIK, if you have this set up correctly, nvidia-smi will be restricted
>> too, though I think we were seeing a bug there at one time in this version.
>>
>> --
>> #BlackLivesMatter
>> ____
>> || \\UTGERS,
>> |---------------------------*O*---------------------------
>> ||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu
>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
>> Campus
>> ||  \\    of NJ     | Office of Advanced Research Computing - MSB C630,
>> Newark
>>     `'
>>
>> On Jan 14, 2021, at 18:05, Abhiram Chintangal <achintangal at berkeley.edu>
>> wrote:
>>
>> 
>> Sean,
>>
>> Thanks for the clarification.I noticed that I am missing the
>> "AllowedDevices" option in mine. After adding this, the GPU allocations
>> started working. (Slurm version 18.08.8)
>>
>> I was also incorrectly using "nvidia-smi" as a check.
>>
>> Regards,
>>
>> Abhiram
>>
>> On Thu, Jan 14, 2021 at 12:22 AM Sean Crosby <scrosby at unimelb.edu.au>
>> wrote:
>>
>>> Hi Abhiram,
>>>
>>> You need to configure cgroup.conf to constrain the devices a job has
>>> access to. See https://slurm.schedmd.com/cgroup.conf.html
>>>
>>> My cgroup.conf is
>>>
>>> CgroupAutomount=yes
>>>
>>> AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf"
>>>
>>> ConstrainCores=yes
>>> ConstrainRAMSpace=yes
>>> ConstrainSwapSpace=yes
>>> ConstrainDevices=yes
>>>
>>> TaskAffinity=no
>>>
>>> CgroupMountpoint=/sys/fs/cgroup
>>>
>>> The ConstrainDevices=yes is the key to stopping jobs from having access
>>> to GPUs they didn't request.
>>>
>>> Sean
>>>
>>> --
>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>>> Research Computing Services | Business Services
>>> The University of Melbourne, Victoria 3010 Australia
>>>
>>>
>>>
>>> On Thu, 14 Jan 2021 at 18:36, Abhiram Chintangal <
>>> achintangal at berkeley.edu> wrote:
>>>
>>>> * UoM notice: External email. Be cautious of links, attachments, or
>>>> impersonation attempts *
>>>> ------------------------------
>>>> Hello,
>>>>
>>>> I recently set up a small cluster at work using Warewulf/Slurm.
>>>> Currently, I am not able to get the scheduler to
>>>> work well with GPU's (Gres).
>>>>
>>>> While slurm is able to filter by GPU type, it allocates all the GPU's
>>>> on the node. See below:
>>>>
>>>> [abhiram at whale ~]$ srun --gres=gpu:p100:2 -n 1 --partition=gpu
>>>>> nvidia-smi --query-gpu=index,name --format=csv
>>>>> index, name
>>>>> 0, Tesla P100-PCIE-16GB
>>>>> 1, Tesla P100-PCIE-16GB
>>>>> 2, Tesla P100-PCIE-16GB
>>>>> 3, Tesla P100-PCIE-16GB
>>>>> [abhiram at whale ~]$ srun --gres=gpu:titanrtx:2 -n 1 --partition=gpu
>>>>> nvidia-smi --query-gpu=index,name --format=csv
>>>>> index, name
>>>>> 0, TITAN RTX
>>>>> 1, TITAN RTX
>>>>> 2, TITAN RTX
>>>>> 3, TITAN RTX
>>>>> 4, TITAN RTX
>>>>> 5, TITAN RTX
>>>>> 6, TITAN RTX
>>>>> 7, TITAN RTX
>>>>>
>>>>
>>>> I am fairly new to Slurm and still figuring out my way around it. I
>>>> would really appreciate any help with this.
>>>>
>>>> For your reference, I attached the slurm.conf and gres.conf files.
>>>>
>>>> Best,
>>>>
>>>> Abhiram
>>>>
>>>> --
>>>>
>>>> Abhiram Chintangal
>>>> QB3 Nogales Lab
>>>> Bioinformatics Specialist @ Howard Hughes Medical Institute
>>>> University of California Berkeley
>>>> 708D Stanley Hall, Berkeley, CA 94720
>>>> Phone (510)666-3344
>>>>
>>>>
>>
>> --
>>
>> Abhiram Chintangal
>> QB3 Nogales Lab
>> Bioinformatics Specialist @ Howard Hughes Medical Institute
>> University of California Berkeley
>> 708D Stanley Hall, Berkeley, CA 94720
>> Phone (510)666-3344
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210114/f463c751/attachment.htm>