<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:DengXian;
        panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:"\@DengXian";
        panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        text-align:justify;
        font-size:10.5pt;
        font-family:DengXian;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:72.0pt 90.0pt 72.0pt 90.0pt;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-AU" link="#0563C1" vlink="#954F72" style="word-wrap:break-word;text-justify-trim:punctuation">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;mso-fareast-language:EN-US">If it’s possible to see other GPUs within a job then that means that cgroups aren’t being used.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;mso-fareast-language:EN-US">Look at the cgroup documentation of slurm (https://slurm.schedmd.com/cgroup.conf.html)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;mso-fareast-language:EN-US">With cgroups activated an `nvidia-smi` will only show the GPU allocated to the job.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;mso-fareast-language:EN-US">   -greg<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal" style="mso-margin-top-alt:0cm;margin-right:0cm;margin-bottom:12.0pt;margin-left:36.0pt">
<b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">slurm-users <slurm-users-bounces@lists.schedmd.com> on behalf of taleintervenor@sjtu.edu.cn <taleintervenor@sjtu.edu.cn><br>
<b>Date: </b>Wednesday, 23 March 2022 at 5:50 pm<br>
<b>To: </b>slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com><br>
<b>Subject: </b>[EXTERNAL] [slurm-users] how to locate the problem when slurm failed to restrict gpu usage of user jobs<o:p></o:p></span></p>
</div>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">Hi, all:<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">We found a problem that slurm job with argument such as
<b>--gres gpu:1 </b>didn</span><span lang="ZH-CN" style="mso-fareast-language:ZH-CN">’</span><span lang="EN-US" style="mso-fareast-language:ZH-CN">t be restricted with gpu usage, user still can see all gpu card on allocated nodes.<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">Our gpu node has 4 cards with their gres.conf to be:<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">> cat /etc/slurm/gres.conf<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia0 CPUs=0-15<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia1 CPUs=16-31<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia2 CPUs=32-47<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia3 CPUs=48-63<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">And for test, we submit simple job batch like:<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">#!/bin/bash<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">#SBATCH --job-name=test<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">#SBATCH --partition=a100<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">#SBATCH --nodes=1<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">#SBATCH --ntasks=6<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">#SBATCH --gres=gpu:1<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">#SBATCH --reservation="gpu test"<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">hostname<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">nvidia-smi<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">echo end<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">Then in the out file the nvidia-smi showed all 4 gpu cards. But we expect to see only 1 allocated gpu card.<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">Official document of slurm said it will set
<b>CUDA_VISIBLE_DEVICES </b>env var to restrict the gpu card available to user. But we didn</span><span lang="ZH-CN" style="mso-fareast-language:ZH-CN">’</span><span lang="EN-US" style="mso-fareast-language:ZH-CN">t find such variable exists in job environment.
 We only confirmed it do exist in prolog script environment by adding debug command
</span><span lang="ZH-CN" style="mso-fareast-language:ZH-CN">“</span><span lang="EN-US" style="mso-fareast-language:ZH-CN">echo $CUDA_VISIBLE_DEVICES</span><span lang="ZH-CN" style="mso-fareast-language:ZH-CN">”</span><span lang="EN-US" style="mso-fareast-language:ZH-CN">
 to slurm prolog script.<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span lang="EN-US" style="mso-fareast-language:ZH-CN">So how do slurm co-operate with nvidia tools to make job user only see its allocated gpu card? What is the requirement on nvidia gpu drivers, CUDA toolkit
 or any other part to help slurm correctly restrict the gpu usage?<o:p></o:p></span></p>
</div>
</body>
</html>