[slurm-users] [EXT] GPU Jobs with Slurm

Ryan Novosielski novosirj at rutgers.edu
Thu Jan 14 23:18:26 UTC 2021


AFAIK, if you have this set up correctly, nvidia-smi will be restricted too, though I think we were seeing a bug there at one time in this version.

--
#BlackLivesMatter
____
|| \\UTGERS,       |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu<mailto:novosirj at rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
    `'

On Jan 14, 2021, at 18:05, Abhiram Chintangal <achintangal at berkeley.edu> wrote:


Sean,

Thanks for the clarification.I noticed that I am missing the "AllowedDevices" option in mine. After adding this, the GPU allocations started working. (Slurm version 18.08.8)

I was also incorrectly using "nvidia-smi" as a check.

Regards,

Abhiram

On Thu, Jan 14, 2021 at 12:22 AM Sean Crosby <scrosby at unimelb.edu.au<mailto:scrosby at unimelb.edu.au>> wrote:
Hi Abhiram,

You need to configure cgroup.conf to constrain the devices a job has access to. See https://slurm.schedmd.com/cgroup.conf.html

My cgroup.conf is

CgroupAutomount=yes
AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes

TaskAffinity=no

CgroupMountpoint=/sys/fs/cgroup

The ConstrainDevices=yes is the key to stopping jobs from having access to GPUs they didn't request.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Thu, 14 Jan 2021 at 18:36, Abhiram Chintangal <achintangal at berkeley.edu<mailto:achintangal at berkeley.edu>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts

________________________________
Hello,

I recently set up a small cluster at work using Warewulf/Slurm. Currently, I am not able to get the scheduler to
work well with GPU's (Gres).

While slurm is able to filter by GPU type, it allocates all the GPU's on the node. See below:

[abhiram at whale ~]$ srun --gres=gpu:p100:2 -n 1 --partition=gpu nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla P100-PCIE-16GB
1, Tesla P100-PCIE-16GB
2, Tesla P100-PCIE-16GB
3, Tesla P100-PCIE-16GB
[abhiram at whale ~]$ srun --gres=gpu:titanrtx:2 -n 1 --partition=gpu nvidia-smi --query-gpu=index,name --format=csv
index, name
0, TITAN RTX
1, TITAN RTX
2, TITAN RTX
3, TITAN RTX
4, TITAN RTX
5, TITAN RTX
6, TITAN RTX
7, TITAN RTX

I am fairly new to Slurm and still figuring out my way around it. I would really appreciate any help with this.

For your reference, I attached the slurm.conf and gres.conf files.

Best,

Abhiram

--

Abhiram Chintangal
QB3 Nogales Lab
Bioinformatics Specialist @ Howard Hughes Medical Institute
University of California Berkeley
708D Stanley Hall, Berkeley, CA 94720
Phone (510)666-3344


--

Abhiram Chintangal
QB3 Nogales Lab
Bioinformatics Specialist @ Howard Hughes Medical Institute
University of California Berkeley
708D Stanley Hall, Berkeley, CA 94720
Phone (510)666-3344
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210114/6444f58e/attachment-0001.htm>


More information about the slurm-users mailing list