<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body>
<p>How are you taking them offline? I would expect a SuspendProgram
script that is running the command that shuts them down. Also, one
of your SlurmctldParameters should be "idle_on_node_suspend"</p>
<p>Brian Andrus<br>
</p>
<div class="moz-cite-prefix">On 4/1/2021 12:25 PM, Sajesh Singh
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:MN2PR14MB2911AFA58091EFF449122F2CAC7B9@MN2PR14MB2911.namprd14.prod.outlook.com">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style>@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}span.EmailStyle20
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}div.WordSection1
{page:WordSection1;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal">Brian,<o:p></o:p></p>
<p class="MsoNormal"> Targeting the correct partition and no
QOS limits imposed that would cause this issue. The only way I
found to remedy is to completely remove the cloud nodes from
Slurm, restart slurmctld, readd nodes to Slurm, restart
slurmctld.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I believe the issue is caused by when the
nodes in the cloud go offline and slurmctld is no longer able
to reach them. I am not able to change the node state manually
so that slurmctld will allow it to be used the next time a
job requires it. I cannot set the state to CLOUD.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The other option may be to bring up all of
the nodes that are in this unknown state so that slurmctld can
go through the motions with them and them run the job again.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">-Sajesh-<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> slurm-users
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com"><slurm-users-bounces@lists.schedmd.com></a>
<b>On Behalf Of </b>Brian Andrus<br>
<b>Sent:</b> Thursday, April 1, 2021 2:51 PM<br>
<b>To:</b> <a class="moz-txt-link-abbreviated" href="mailto:slurm-users@lists.schedmd.com">slurm-users@lists.schedmd.com</a><br>
<b>Subject:</b> Re: [slurm-users] Limit on number of nodes
user able to request<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:solid #9C6500 1.0pt;padding:2.0pt 2.0pt 2.0pt
2.0pt">
<p class="MsoNormal"
style="line-height:12.0pt;background:#FFEB9C"><b><span
style="font-size:10.0pt;color:black">EXTERNAL SENDER</span></b><span
style="font-size:10.0pt;color:black"><o:p></o:p></span></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p>For this one, you want to look closely at the job. Is it
targeting a specific partition/nodelist?<o:p></o:p></p>
<p>See what resources it is looking for (scontrol show job
<jobid>)<br>
Also look at the partition limits as well as any QOS items
(if you are using them).<o:p></o:p></p>
<p>Brian Andrus<o:p></o:p></p>
<div>
<p class="MsoNormal">On 4/1/2021 10:00 AM, Sajesh Singh
wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">Some additional information after
enabling debug3 on slurmctld daemon:<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Logs show that there are enough usable
nodes for the job:<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-11<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-12<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-13<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-14<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-15<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-16<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-17<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-18<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-19<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-20<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-21<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-22<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-23<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-24<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-25<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-26<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-27<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-28<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-29<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-30<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-31<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-32<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-33<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-34<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-35<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-36<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-37<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-38<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-39<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-40<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-41<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-42<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-43<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-44<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-45<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-46<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-47<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-48<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-49<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-50<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-51<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-52<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-53<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-54<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-55<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-56<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-57<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-58<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-59<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-60<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-61<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-62<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-63<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-64<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-65<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-66<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-67<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-68<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-69<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-70<o:p></o:p></p>
<p class="MsoNormal">[2021-04-01T10:39:14.400] debug2: found
1 usable nodes from config containing node-71<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">But then the following line is in the
log as well:<o:p></o:p></p>
<p class="MsoNormal">debug3: select_nodes: JobId=67171529
required nodes not avail<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">--<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">-Sajesh-<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> slurm-users <a
href="mailto:slurm-users-bounces@lists.schedmd.com"
moz-do-not-send="true">
<slurm-users-bounces@lists.schedmd.com></a> <b>On
Behalf Of </b>Sajesh Singh<br>
<b>Sent:</b> Thursday, March 25, 2021 9:02 AM<br>
<b>To:</b> Slurm User Community List <a
href="mailto:slurm-users@lists.schedmd.com"
moz-do-not-send="true">
<slurm-users@lists.schedmd.com></a><br>
<b>Subject:</b> Re: [slurm-users] Limit on number of
nodes user able to request<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<div style="border:solid #9C6500 1.0pt;padding:2.0pt 2.0pt
2.0pt 2.0pt">
<p class="MsoNormal"
style="line-height:12.0pt;background:#FFEB9C"><b><span
style="font-size:10.0pt;color:black">EXTERNAL SENDER</span></b><o:p></o:p></p>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">No nodes in downed or drained state.
These are nodes that are dynamically brought up and down
via the powersave plugin. When the are taken offline due
to being idle I believe the state is set to FUTURE by
the powersave plugin.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">-Sajesh-<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> slurm-users <<a
href="mailto:slurm-users-bounces@lists.schedmd.com"
moz-do-not-send="true">slurm-users-bounces@lists.schedmd.com</a>>
<b>On Behalf Of </b>Brian Andrus<br>
<b>Sent:</b> Wednesday, March 24, 2021 11:02 PM<br>
<b>To:</b> <a
href="mailto:slurm-users@lists.schedmd.com"
moz-do-not-send="true">slurm-users@lists.schedmd.com</a><br>
<b>Subject:</b> Re: [slurm-users] Limit on number of
nodes user able to request<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<div style="border:solid #9C6500 1.0pt;padding:2.0pt 2.0pt
2.0pt 2.0pt">
<p class="MsoNormal"
style="line-height:12.0pt;background:#FFEB9C"><b><span
style="font-size:10.0pt;color:black">EXTERNAL
SENDER</span></b><o:p></o:p></p>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p>Do 'sinfo -R' and see if you have any down or drained
nodes.<o:p></o:p></p>
<p>Brian Andrus<o:p></o:p></p>
<div>
<p class="MsoNormal">On 3/24/2021 6:31 PM, Sajesh
Singh wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">Slurm 20.02<o:p></o:p></p>
<p class="MsoNormal">CentOS 8<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">I just recently noticed a strange
behavior when using the powersave plugin for
bursting to AWS. I have a queue configured with 60
nodes, but if I submit a job to use all of the nodes
I get the error:<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">(Nodes required for job are DOWN,
DRAINED or reserved for jobs in higher priority
partitions<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">If I lower the job to request 50
nodes it gets submitted and runs with no problems. I
do not have and associations or QOS limits in place
that would limit the user. Any ideas as to what
could be causing this limit of 50 nodes to be
imposed?<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">-Sajesh-<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</body>
</html>