[slurm-users] [EXT] GPU Jobs with Slurm
Abhiram Chintangal
achintangal at berkeley.edu
Thu Jan 14 23:05:15 UTC 2021
Sean,
Thanks for the clarification.I noticed that I am missing the
"AllowedDevices" option in mine. After adding this, the GPU allocations
started working. (Slurm version 18.08.8)
I was also incorrectly using "nvidia-smi" as a check.
Regards,
Abhiram
On Thu, Jan 14, 2021 at 12:22 AM Sean Crosby <scrosby at unimelb.edu.au> wrote:
> Hi Abhiram,
>
> You need to configure cgroup.conf to constrain the devices a job has
> access to. See https://slurm.schedmd.com/cgroup.conf.html
>
> My cgroup.conf is
>
> CgroupAutomount=yes
> AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf"
>
> ConstrainCores=yes
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
> ConstrainDevices=yes
>
> TaskAffinity=no
>
> CgroupMountpoint=/sys/fs/cgroup
>
> The ConstrainDevices=yes is the key to stopping jobs from having access to
> GPUs they didn't request.
>
> Sean
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
> On Thu, 14 Jan 2021 at 18:36, Abhiram Chintangal <achintangal at berkeley.edu>
> wrote:
>
>> * UoM notice: External email. Be cautious of links, attachments, or
>> impersonation attempts *
>> ------------------------------
>> Hello,
>>
>> I recently set up a small cluster at work using Warewulf/Slurm.
>> Currently, I am not able to get the scheduler to
>> work well with GPU's (Gres).
>>
>> While slurm is able to filter by GPU type, it allocates all the GPU's on
>> the node. See below:
>>
>> [abhiram at whale ~]$ srun --gres=gpu:p100:2 -n 1 --partition=gpu
>>> nvidia-smi --query-gpu=index,name --format=csv
>>> index, name
>>> 0, Tesla P100-PCIE-16GB
>>> 1, Tesla P100-PCIE-16GB
>>> 2, Tesla P100-PCIE-16GB
>>> 3, Tesla P100-PCIE-16GB
>>> [abhiram at whale ~]$ srun --gres=gpu:titanrtx:2 -n 1 --partition=gpu
>>> nvidia-smi --query-gpu=index,name --format=csv
>>> index, name
>>> 0, TITAN RTX
>>> 1, TITAN RTX
>>> 2, TITAN RTX
>>> 3, TITAN RTX
>>> 4, TITAN RTX
>>> 5, TITAN RTX
>>> 6, TITAN RTX
>>> 7, TITAN RTX
>>>
>>
>> I am fairly new to Slurm and still figuring out my way around it. I would
>> really appreciate any help with this.
>>
>> For your reference, I attached the slurm.conf and gres.conf files.
>>
>> Best,
>>
>> Abhiram
>>
>> --
>>
>> Abhiram Chintangal
>> QB3 Nogales Lab
>> Bioinformatics Specialist @ Howard Hughes Medical Institute
>> University of California Berkeley
>> 708D Stanley Hall, Berkeley, CA 94720
>> Phone (510)666-3344
>>
>>
--
Abhiram Chintangal
QB3 Nogales Lab
Bioinformatics Specialist @ Howard Hughes Medical Institute
University of California Berkeley
708D Stanley Hall, Berkeley, CA 94720
Phone (510)666-3344
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210114/fc54ea73/attachment.htm>
More information about the slurm-users
mailing list