<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Aptos;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:11.0pt;
font-family:"Aptos",sans-serif;
mso-fareast-language:EN-US;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:11.0pt;
font-family:"Aptos",sans-serif;
mso-ligatures:none;
mso-fareast-language:EN-US;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="FR" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">Hello Experts,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span lang="EN-US">I’m a new Slurm user (so please bare with me
</span><span lang="EN-US" style="font-family:Wingdings">J</span><span lang="EN-US"> …).
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Recently we’ve deployed Slurm version 23.11 on a very simple cluster, which consists of a Master node (acting as a Login & Slurmdbd node as well), a Compute Node which has a NVIDIA HGX A100-SXM4-40GB GPU, detected as
4 x GPU’s: GPU [0-4], and a Storage Array presenting/sharing the NFS disk (where users’ home directories will be created as well).<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">The problem is that I’ve never been able to run a simple/dummy batch script in a parallel way using the 4 GPU’s. In fact, running the same command “sbatch gpu-job.sh” multiple times shows that only one single job is running,
while the other jobs are in a pending state:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Submitted batch job 214<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Submitted batch job 215<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Submitted batch job 216<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">[slurmtest@c-a100-master test-batch-scripts]$ squeue<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"> 216 gpu gpu-job slurmtest<b><span style="color:red"> PD</span></b> 0:00 1 (None)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"> 215 gpu gpu-job slurmtest<b><span style="color:red"> PD</span></b> 0:00 1 (Priority)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"> 214 gpu gpu-job slurmtest
<b><span style="color:red">PD</span></b> 0:00 1 (Priority)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"> 213 gpu gpu-job slurmtest<b><span style="color:red"> PD</span></b> 0:00 1 (Priority)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"> 212 gpu gpu-job slurmtest<b><span style="color:red"> PD</span></b> 0:00 1 (Resources)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"> 211 gpu gpu-job slurmtest<b><span style="color:#00B050"> R
</span></b><span style="color:#00B050"> </span>0:14 1 c-a100-cn01<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><b><u><span lang="EN-US">PS</span></u></b><span lang="EN-US">: CPU jobs (i.e. using the default debug partition, without call the GPU Gres) can be run in parallel. The issue with running parallel jobs is only seen when using the GPU’s as
Gres.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">I’ve tried many combinations of settings in gres.conf and slurm.conf, many (if not most) of these combinations would result in error messages in slurmctld and slurmd logs.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">The current gres.conf and slurm.conf contents is shown below. Even though it doesn’t give errors when restarting slurmctld and slurmd services (on master and compute nodes, resp.), but as I said, it doesn’t allow jobs
to be executed in parallel. Batch script contents shared below as well, in order to give more clarity on what I’m trying to do:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">[root@c-a100-master slurm]# <b>cat gres.conf | grep -v "^#"</b><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">NodeName=c-a100-cn01 AutoDetect=nvml Name=gpu Type=A100 File=/dev/nvidia[0-3]<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">[root@c-a100-master slurm]# <b>cat slurm.conf | grep -v "^#" | egrep -i "AccountingStorageTRES|GresTypes|NodeName|partition"</b><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">GresTypes=gpu<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">AccountingStorageTRES=gres/gpu<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">NodeName=c-a100-cn01 Gres=gpu:A100:4 CPUs=64 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=515181 State=UNKNOWN<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">PartitionName=gpu Nodes=ALL MaxTime=10:0:0<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">[slurmtest@c-a100-master test-batch-scripts]$
<b>cat gpu-job.sh</b><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">#!/bin/bash<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">#SBATCH --job-name=gpu-job<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">#SBATCH --partition=gpu<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">#SBATCH --nodes=1<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">#SBATCH --gpus-per-node=4<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">#SBATCH --gres=gpu:4<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">#SBATCH --tasks-per-node=1<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">#SBATCH --output=gpu_job_output.%j # Output file name (replaces %j with job ID)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">#SBATCH --error=gpu_job_error.%j # Error file name (replaces %j with job ID)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">hostname<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">date<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">sleep 40<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">pwd<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:FR"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:FR"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:FR">Any help on which changes need to be made to the config files (mainly slurm.conf and gres.cong) and/or the batch script, so that multiple jobs can be in a “Running” state at the same time
(in parallel) ?<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:FR"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:FR">Thanks in advance for your help !<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:FR"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:FR"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:10.0pt;font-family:"Arial",sans-serif;color:black;mso-ligatures:standardcontextual;mso-fareast-language:FR">Best regards,<o:p></o:p></span></p>
<p class="MsoNormal" style="page-break-after:avoid"><b><span lang="EN-US" style="font-size:10.0pt;font-family:"Arial",sans-serif;color:black"><o:p> </o:p></span></b></p>
<p class="MsoNormal" style="page-break-after:avoid"><b><span lang="EN-US" style="font-size:10.0pt;font-family:"Arial",sans-serif;color:black">Hafedh Kherfani
<o:p></o:p></span></b></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>