[slurm-users] [EXT] GPU Jobs with Slurm

Ryan Novosielski novosirj at rutgers.edu
Fri Jan 15 16:43:02 UTC 2021


Do you have any more information about that? I think that’s the bug I alluded to earlier in the conversation, and I believe I’m affected by it, but don’t know how to tell, how to fix it, or how to refer to it if I wanted to ask SchedMD (we have a contract).

--
#BlackLivesMatter
____
|| \\UTGERS,       |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu<mailto:novosirj at rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
    `'

On Jan 14, 2021, at 18:56, Fulcomer, Samuel <samuel_fulcomer at brown.edu> wrote:


Also note that there was a bug in an older version of SLURM (pre-17-something) that corrupted the database in a way that prevented GPU/gres fencing. If that affected you and you're still using the same database, GPU fencing probably isn't working. There's a way of fixing this manually through sql hacking; however, we just went with a virgin database when we last upgraded in order to get it working (and sucked the accounting data into XDMoD).



On Thu, Jan 14, 2021 at 6:36 PM Fulcomer, Samuel <samuel_fulcomer at brown.edu<mailto:samuel_fulcomer at brown.edu>> wrote:
AllowedDevicesFile should not be necessary. The relevant devices are identified in gres.conf. "ConstrainDevices=yes" should be all that's needed.

nvidia-smi will only see the allocated GPUs. Note that a single allocated GPU will always be shown by nvidia-smi to be GPU 0, regardless of its actual hardware ordinal, and GPU_DEVICE_ORDINAL will be set to 0. The value of SLURM_STEP_GPUS will be set to the actual device number (N, where the device is /dev/nvidiaN).

On Thu, Jan 14, 2021 at 6:20 PM Ryan Novosielski <novosirj at rutgers.edu<mailto:novosirj at rutgers.edu>> wrote:
AFAIK, if you have this set up correctly, nvidia-smi will be restricted too, though I think we were seeing a bug there at one time in this version.

--
#BlackLivesMatter
____
|| \\UTGERS,       |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu<mailto:novosirj at rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
    `'

On Jan 14, 2021, at 18:05, Abhiram Chintangal <achintangal at berkeley.edu<mailto:achintangal at berkeley.edu>> wrote:


Sean,

Thanks for the clarification.I noticed that I am missing the "AllowedDevices" option in mine. After adding this, the GPU allocations started working. (Slurm version 18.08.8)

I was also incorrectly using "nvidia-smi" as a check.

Regards,

Abhiram

On Thu, Jan 14, 2021 at 12:22 AM Sean Crosby <scrosby at unimelb.edu.au<mailto:scrosby at unimelb.edu.au>> wrote:
Hi Abhiram,

You need to configure cgroup.conf to constrain the devices a job has access to. See https://slurm.schedmd.com/cgroup.conf.html

My cgroup.conf is

CgroupAutomount=yes
AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes

TaskAffinity=no

CgroupMountpoint=/sys/fs/cgroup

The ConstrainDevices=yes is the key to stopping jobs from having access to GPUs they didn't request.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Thu, 14 Jan 2021 at 18:36, Abhiram Chintangal <achintangal at berkeley.edu<mailto:achintangal at berkeley.edu>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts

________________________________
Hello,

I recently set up a small cluster at work using Warewulf/Slurm. Currently, I am not able to get the scheduler to
work well with GPU's (Gres).

While slurm is able to filter by GPU type, it allocates all the GPU's on the node. See below:

[abhiram at whale ~]$ srun --gres=gpu:p100:2 -n 1 --partition=gpu nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla P100-PCIE-16GB
1, Tesla P100-PCIE-16GB
2, Tesla P100-PCIE-16GB
3, Tesla P100-PCIE-16GB
[abhiram at whale ~]$ srun --gres=gpu:titanrtx:2 -n 1 --partition=gpu nvidia-smi --query-gpu=index,name --format=csv
index, name
0, TITAN RTX
1, TITAN RTX
2, TITAN RTX
3, TITAN RTX
4, TITAN RTX
5, TITAN RTX
6, TITAN RTX
7, TITAN RTX

I am fairly new to Slurm and still figuring out my way around it. I would really appreciate any help with this.

For your reference, I attached the slurm.conf and gres.conf files.

Best,

Abhiram

--

Abhiram Chintangal
QB3 Nogales Lab
Bioinformatics Specialist @ Howard Hughes Medical Institute
University of California Berkeley
708D Stanley Hall, Berkeley, CA 94720
Phone (510)666-3344


--

Abhiram Chintangal
QB3 Nogales Lab
Bioinformatics Specialist @ Howard Hughes Medical Institute
University of California Berkeley
708D Stanley Hall, Berkeley, CA 94720
Phone (510)666-3344
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210115/42ce1407/attachment-0001.htm>


More information about the slurm-users mailing list