[slurm-users] Enforce gpu usage limits (with GRES?)

Wed Feb 1 17:12:40 UTC 2023

Hi,

I'm new to slurm, so I apologize in advance if my question seems basic.

I just purchased a single node 'cluster' consisting of one 64-core cpu and
an nvidia rtx5k gpu (Turing architecture, I think). The vendor supplied it
with ubuntu 20.04 and slurm-wlm 19.05.5. Now I'm trying to adjust the
config to suit the needs of my department.

I'm trying to bone up on GRES scheduling by reading this manual page
<https://slurm.schedmd.com/gres.html>, but am confused about some things.

My slurm.conf file has the following lines put in it by the vendor:

###################
# COMPUTE NODES
GresTypes=gpu
NodeName=shavak-DIT400TR-55L CPUs=64 SocketsPerBoard=2 CoresPerSocket=32
ThreadsPerCore=1 RealMemory=95311 Gres=gpu:1
#PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

PartitionName=CPU Nodes=ALL Default=Yes MaxTime=INFINITE  State=UP

PartitionName=GPU Nodes=ALL Default=NO MaxTime=INFINITE  State=UP
#####################

So they created two partitions that are essentially identical. Secondly,
they put just the following line in gres.conf:

###################
NodeName=shavak-DIT400TR-55L      Name=gpu        File=/dev/nvidia0
###################

That's all. However, this configuration does not appear to constrain anyone
in any manner. As a regular user, I can still use srun or sbatch to start
GPU jobs from the "CPU partition," and nvidia-smi says that a simple cupy
<https://cupy.dev/> script that multiplies matrices and starts as an sbatch
job in the CPU partition can access the gpu just fine. Note that the
environment variable "CUDA_VISIBLE_DEVICES" does not appear to be set in
any job step. I tested this by starting an interactive srun shell in both
CPU and GPU partition and running ''echo $CUDA_VISIBLE_DEVICES" and got
bupkis for both.

What I need to do is constrain jobs to using chunks of GPU Cores/RAM so
that multiple jobs can share the GPU.

As I understand from the gres manpage, simply adding "AutoDetect=nvml"
(NVML should be installed with the NVIDIA HPC SDK, right? I installed it
with apt-get...) in gres.conf should allow Slurm to detect the GPU's
internal specifications automatically. Is that all, or do I need to config
an mps GRES as well? Will that succeed in jailing out the GPU from jobs
that don't mention any gres parameters (perhaps by setting
CUDA_VISIBLE_DEVICES), or is there any additional config for that? Do I
really need that extra "GPU" partition that the vendor put in for any of
this, or is there a way to bind GRES resources to a particular partition in
such a way that simply launching jobs in that partition will be enough?

Thanks for your attention.
Regards
AR

-- 
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in, hariseldon99 at gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230201/a0610bdd/attachment.htm>