<html><body><div style="font-family: arial, helvetica, sans-serif; font-size: 12pt; color: #000000"><div>Hi everyone,<br></div><div><br data-mce-bogus="1"></div><div>I'm in charge of the new cluster of GPU in my lab. </div><div><br data-mce-bogus="1"></div><div>I'm using cgroup to restrict access to ressources, especially GPUs.<br data-mce-bogus="1"></div><div>It works fine when user use the connection created by slurm.<br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div>I am using the pam_slurm_adopt.so module to give ssh access to a node if the user already has a job running on it.<br data-mce-bogus="1"></div><div>When connecting to the node threw ssh, the user can see and use all the GPUs of the node, even if he asked for just one. </div><div>This is really problematic as most user use the cluster by connecting their IDE with ssh to the cluster.<br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div>I can't find any related ressources on the internet and in the old mails, do you have any idea what I am missing?</div><div>I'm not an expert, and working in the system administration for 5 month...<br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div>Thanks in advance,<br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div>Guillaume<br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div>Notes:<br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div>I have slurm 21.08.5</div><div>The gateway (slurmctld) is running on Ubuntu 22.04 and the nodes under Fedora 36.<br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div>Here is my slurm.conf:<br></div><div># slurm.conf file generated by configurator easy.html.<br># Put this file on all nodes of your cluster.<br># See the slurm.conf man page for more information.<br>#<br>ClusterName=gpucluster<br>SlurmctldHost=gpu-gw<br>#<br>MailProg=/bin/mail<br>MpiDefault=none<br>#MpiParams=ports=#-#<br>PrologFlags=contain<br>ProctrackType=proctrack/cgroup<br>ReturnToService=1<br>SlurmctldPidFile=/var/run/slurmctld.pid<br>#SlurmctldPort=6817<br>SlurmdPidFile=/var/run/slurmd.pid<br>#SlurmdPort=6818<br>SlurmdSpoolDir=/var/spool/slurmd<br>SlurmUser=slurm<br>#SlurmdUser=root<br>StateSaveLocation=/var/spool/slurmctld<br>SwitchType=switch/none<br>TaskPlugin=task/cgroup<br>#<br>#<br># TIMERS<br>KillWait=30<br>#<br>#<br># SCHEDULING<br>SchedulerType=sched/backfill<br>SelectType=select/cons_tres<br>SelectTypeParameters=CR_Core<br>#<br># PRIORITY<br># Activate the multi factor priority plugin<br>PriorityType=priority/multifactor<br># Reset usage after 1 week<br>PriorityUsageResetPeriod=WEEKLY<br>#apply no decay<br>PriorityDecayHalfLife=0<br># The smaller the job, the greater its job size priority.<br>PriorityFavorSmall=YES<br># The job's age factor reaches 1.0 after waiting in the queue for a weeks.<br>PriorityMaxAge=7-0<br># This next group determines the weighting of each of the<br># components of the Multifactor Job Priority Plugin.<br># The default value for each of the following is 1.<br>PriorityWeightAge=1000<br>PriorityWeightFairshare=10000<br>PriorityWeightJobSize=1000<br>PriorityWeightPartition=1000<br>PriorityWeightQOS=0 # don't use the qos factor<br>#<br>#<br>#<br>#MEMORY MAX AND DEFAULT VALUES (Mo) <br>DefCpuPerGPU=2<br>#<br>#<br># LOGGING AND ACCOUNTING<br>AccountingStorageType=accounting_storage/slurmdbd<br>AccountingStorageTRES=gres/gpu<br>AccountingStoreFlags=job_comment<br>AccountingStoragePort=6819<br>JobAcctGatherType=jobacct_gather/cgroup<br>SlurmctldLogFile=/var/log/slurm/slurmctld.log<br>SlurmdLogFile=/var/log/slurm/slurmd.log<br># PREEMPTION<br>PreemptMode=Requeue<br>PreemptType=preempt/qos<br>#<br># COMPUTE NODES<br>GresTypes=gpu<br>NodeName=node0[1-7] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=386683 Gres=gpu:3 Features="nvidia,ampere,A100,pcie" State=UNKNOWN<br>NodeName=nodemm01 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=16000 Gres=gpu:4 Features="nvidia,GeForce,RTX3090" State=UNKNOWN<br>PartitionName=ids Nodes=node0[1-7] Default=YES MaxTime=INFINITE AllowQos=normal,default,preempt State=UP<br>PartitionName=mm Nodes=nodemm01 MaxTime=INFINITE State=UP<br><br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div>Here is my cgroup.conf:<br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div>###<br># Slurm cgroup support configuration file.<br>###<br>CgroupAutomount=yes<br>#CgroupMountpoint=/sys/fs/cgroup<br>ConstrainCores=yes<br>ConstrainDevices=yes<br>#ConstrainKmemSpace=no #avoid known Kernel issues<br>ConstrainRAMSpace=yes<br>ConstrainSwapSpace=yes<br><br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div data-marker="__SIG_PRE__"><div><br>
<table style="background-color:#f1f1f1; border:1px #ddd solid;">
<tbody><tr>
<td>
<a href="https://www.telecom-paris.fr" target="_blank"><img src="https://img.imt.fr/signature-mail/logos/logo-tp.png" title="Site web Télécom Paris" alt="Télécom Paris" style="padding:4px 0 4px 4px;"></a>
</td>
<td><div style="float:left; font: 13px/1.5 sans-serif; color:#555; padding:8px 15px;">
<strong>Guillaume LECHANTRE</strong><br>
Ingénieur de Recherche et Développement<br>
-</div>
<div style="float:left; font: bold 11px/1.5 sans-serif; color:#888; padding:8px 15px;">
19 place Marguerite Perey<br>CS 20031<br>91123 Palaiseau Cedex<br>
<a href="https://www.telecom-paris.fr" target="_blank" style="text-decoration:none;"><img src="https://img.imt.fr/signature-mail/social/web.png" title="Site web Télécom Paris" alt="Site web Télécom Paris" style="padding:3px 10px 4px 0;"></a><a href="https://twitter.com/TelecomParis_" target="_blank" style="text-decoration:none;"><img src="https://img.imt.fr/signature-mail/social/so-tw.png" title="Twitter Télécom Paris" alt="Twitter Télécom Paris" style="padding:3px 10px 4px 0;"></a><a href="https://www.facebook.com/TelecomParis" target="_blank" style="text-decoration:none;"><img src="https://img.imt.fr/signature-mail/social/so-fb.png" title="Facebook Télécom Paris" alt="Facebook Télécom Paris" style="padding:3px 10px 4px 0;"></a><a href="https://www.linkedin.com/school/telecom-paris/" target="_blank" style="text-decoration:none;"><img src="https://img.imt.fr/signature-mail/social/so-in.png" title="LinkedIn Télécom Paris" alt="LinkedIn Télécom Paris" style="padding:3px 10px 4px 0;"></a><a href="https://www.instagram.com/telecom_paris/" target="_blank" style="text-decoration:none;"><img src="https://img.imt.fr/signature-mail/social/so-ig.png" title="Instagram Télécom Paris" alt="Instagram Télécom Paris" style="padding:3px 10px 4px 0;"></a><a href="https://imtech.wp.imt.fr" target="_blank" style="text-decoration:none;"><img src="https://img.imt.fr/signature-mail/social/blog.png" title="Blog Télécom Paris" alt="Blog Télécom Paris" style="padding:3px 10px 4px 0;"></a><br>Une école de <a href="https://www.imt.fr" target="_blank">l'IMT</a><br></div>
</td>
</tr>
</tbody></table>
<br><img src="https://img.imt.fr/signature-mail/footer-email-GestesBarrieres.png" title="Information" alt="Information"></div></div></div></body></html>