<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle18
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal">You cannot have two default partitions. <o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The slurm.conf is picking up the last of the entries flagged as Default; because the compute partition has no partition specified it is being sent to the default partition, thus the first srun is being submitted to the compute partition,
and that partition has only a node with no gpu GRES defined, thus the srun is correct to fail.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">What is the output of the command <o:p></o:p></p>
<p class="MsoNormal">sinfo<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The partition understood to be the default will be indicated with * after the partition name.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> slurm-users <slurm-users-bounces@lists.schedmd.com>
<b>On Behalf Of </b>Durai Arasan<br>
<b>Sent:</b> Tuesday, May 12, 2020 10:47 AM<br>
<b>To:</b> slurm-users@lists.schedmd.com<br>
<b>Cc:</b> benjamin.glaessle@uni-tuebingen.de<br>
<b>Subject:</b> [slurm-users] slurm only looking in "default" partition during scheduling<o:p></o:p></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal">Hi,<o:p></o:p></p>
<div>
<p class="MsoNormal">We have a cluster with 2 slave nodes. These are the slurm.conf lines describing nodes and partitions:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><i>NodeName=slurm-gpu-1 NodeAddr=192.168.0.200 Procs=16 Gres=gpu:2 State=UNKNOWN<br>
NodeName=slurm-gpu-2 NodeAddr=192.168.0.124 Procs=1 Gres=gpu:0 State=UNKNOWN<br>
PartitionName=gpu Nodes=slurm-gpu-1 Default=YES MaxTime=INFINITE State=UP<br>
PartitionName=compute Nodes=slurm-gpu-2 Default=YES MaxTime=INFINITE State=UP</i><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Running sinfo gives the following:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><i>PARTITION AVAIL TIMELIMIT NODES STATE NODELIST<br>
gpu up infinite 1 idle slurm-gpu-1<br>
compute* up infinite 1 idle slurm-gpu-2</i><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">When I request a gpu job to be run using the following command:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><i>srun --gres=gpu:2 nvidia-smi</i><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">I get the error:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><i>srun: error: Unable to allocate resources: Requested node configuration is not available</i><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">and in slurmctld.log these are the entries:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><i>[2020-05-12T14:33:47.578] _pick_best_nodes: JobId=55 never runnable in partition compute<br>
[2020-05-12T14:33:47.578] _slurm_rpc_allocate_resources: Requested node configuration is not available </i><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">It seems like slurm is looking only in the partition "compute" and not in the other partitions.<br>
Even if I explicitly specify the gpu node to srun it fails:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><i>srun --nodelist=slurm-gpu-1 nvidia-smi</i><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">I get the same error:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><i>srun: error: Unable to allocate resources: Requested node configuration is not available</i><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">and in slurmctld.log:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><i>[2020-05-12T14:38:57.242] No nodes satisfy requirements for JobId=56 in partition compute<br>
[2020-05-12T14:38:57.242] _slurm_rpc_allocate_resources: Requested node configuration is not available</i><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">It is still looking in partition "compute" even after specifying the node to srun. <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">But when I specify a partition, it works:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><i>srun -p gpu nvidia-smi</i><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">But I would not like to specify the partition and would like slurm to select nodes based on the options specified in the srun command. Does anyone understand what is wrong in the setup?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Durai<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><o:p> </o:p></p>
</div>
</div>
</div>
</body>
</html>