[slurm-users] Cgroup not restricting GPUs acces with ssh
Guillaume Lechantre
guillaume.lechantre at telecom-paris.fr
Thu Feb 9 14:09:55 UTC 2023
Hi everyone,
I'm in charge of the new cluster of GPU in my lab.
I'm using cgroup to restrict access to ressources, especially GPUs.
It works fine when user use the connection created by slurm.
I am using the pam_slurm_adopt.so module to give ssh access to a node if the user already has a job running on it.
When connecting to the node threw ssh, the user can see and use all the GPUs of the node, even if he asked for just one.
This is really problematic as most user use the cluster by connecting their IDE with ssh to the cluster.
I can't find any related ressources on the internet and in the old mails, do you have any idea what I am missing?
I'm not an expert, and working in the system administration for 5 month...
Thanks in advance,
Guillaume
Notes:
I have slurm 21.08.5
The gateway (slurmctld) is running on Ubuntu 22.04 and the nodes under Fedora 36.
Here is my slurm.conf:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=gpucluster
SlurmctldHost=gpu-gw
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
PrologFlags=contain
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
KillWait=30
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
# PRIORITY
# Activate the multi factor priority plugin
PriorityType=priority/multifactor
# Reset usage after 1 week
PriorityUsageResetPeriod=WEEKLY
#apply no decay
PriorityDecayHalfLife=0
# The smaller the job, the greater its job size priority.
PriorityFavorSmall=YES
# The job's age factor reaches 1.0 after waiting in the queue for a weeks.
PriorityMaxAge=7-0
# This next group determines the weighting of each of the
# components of the Multifactor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0 # don't use the qos factor
#
#
#
#MEMORY MAX AND DEFAULT VALUES (Mo)
DefCpuPerGPU=2
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageTRES=gres/gpu
AccountingStoreFlags=job_comment
AccountingStoragePort=6819
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
# PREEMPTION
PreemptMode=Requeue
PreemptType=preempt/qos
#
# COMPUTE NODES
GresTypes=gpu
NodeName=node0[1-7] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=386683 Gres=gpu:3 Features="nvidia,ampere,A100,pcie" State=UNKNOWN
NodeName=nodemm01 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=16000 Gres=gpu:4 Features="nvidia,GeForce,RTX3090" State=UNKNOWN
PartitionName=ids Nodes=node0[1-7] Default=YES MaxTime=INFINITE AllowQos=normal,default,preempt State=UP
PartitionName=mm Nodes=nodemm01 MaxTime=INFINITE State=UP
Here is my cgroup.conf:
###
# Slurm cgroup support configuration file.
###
CgroupAutomount=yes
#CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=yes
#ConstrainKmemSpace=no #avoid known Kernel issues
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
[ https://www.telecom-paris.fr/ ]
Guillaume LECHANTRE
Ingénieur de Recherche et Développement
-
19 place Marguerite Perey
CS 20031
91123 Palaiseau Cedex
[ https://www.telecom-paris.fr/ ] [ https://twitter.com/TelecomParis_ ] [ https://www.facebook.com/TelecomParis ] [ https://www.linkedin.com/school/telecom-paris/ ] [ https://www.instagram.com/telecom_paris/ ] [ https://imtech.wp.imt.fr/ ]
Une école de [ https://www.imt.fr/ | l'IMT ]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230209/83b16533/attachment.htm>
More information about the slurm-users
mailing list