<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Thanks, that was very useful. The key takeaways for me are:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Set “PrivateData=cloud”. Documentation states that the default is that everything’s public and that those options make things private. Apparently except for this
case which allows regular uses to see nodes that are powered down.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Set “ReturnToService=2”. One of the problems I’m having is nodes taking too long to boot up due to initialization I was doing. This option will bring the nodes
up even after the resume timeout causes them to be marked down.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Use strigger to take a custom action. I will use this to alarm when nodes go down, etc.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Thanks for the help. I think it will solve the issues I’m having.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><b><span style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span style="font-size:11.0pt;font-family:"Calibri",sans-serif"> Kirill 'kkm' Katsnelson [mailto:kkm@pobox.com]
<br>
<b>Sent:</b> Friday, February 28, 2020 5:56 AM<br>
<b>To:</b> Slurm User Community List <slurm-users@lists.schedmd.com><br>
<b>Cc:</b> Carter, Allan <cartalla@amazon.com><br>
<b>Subject:</b> Re: [slurm-users] How to show state of CLOUD nodes<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New"">I'm running clusters entirely in Google Cloud. I'm not sure I'm understanding the issue--do the nodes disappear from view entirely only when they fail to power up by ResumeTimeout? Failures of this
kind are happening in GCE when resources are momentarily unavailable, but the nodes are still there, only shown as DOWN. FWIW, I'm currently using 19.05.4-1.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New"">I have a trigger on the controller to catch and return these nodes back to POWER_SAVE. The offset of 20s lets all moving parts to settle; in any case, Slurm batches trigger runs internally, on a 15s
schedule IIRC, so it's not precise. --flags=PERM makes the trigger permanent, so you need to install it once and for all:</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New"">strigger --set --down --flags=PERM --offset=20 --program=$script</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New"">and the $script points to the full path (on the controller) of the following script. I'm copying the Log function and the _logname gymnastics from a file which is dot-sourced by the main program in
my setup, as it's part of a larger set of scripts; it's more complex than it has to be for your case, but I did not want to introduce a bug by hastily paring it down. You'll do that if you want.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New""> ----8<--------8<--------8<---- </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New"">#!/bin/bash<br>
<br>
set -u<br>
<br>
# Tidy up name for logging: '.../slurm_resume.sh' => 'slurm-resume'<br>
_logname=$(basename "$0")<br>
_logname=${_logname%%.*}<br>
_logname=${_logname//_/-}<br>
<br>
Log() {<br>
local level=$1; shift;<br>
[[ $level == *.* ]] || level=daemon.$level # So we can use e.g. auth.notice.<br>
logger -p $level -t $_logname -- "$@"<br>
}<br>
<br>
reason=recovery<br>
<br>
for n; do<br>
Log notice "Recovering failed node(s) '$n'"<br>
scontrol update nodename="$n" reason="$reason" state=DRAIN &&<br>
scontrol update nodename="$n" reason="$reason" state=POWER_DOWN ||<br>
Log alert "The command 'scontrol update nodename=$n' failed." \<br>
"Is scontrol on PATH?"<br>
done<br>
<br>
exit 0</span><o:p></o:p></p>
</div>
<div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New""> ----8<--------8<--------8<----
</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New"">The sequence of DRAIN first then POWER_DOWN is a magic left over from v18; see if POWER_DOWN alone does the trick. Or don't, as long as it works :)</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New"">Also make sure you have (some of) the following in slurm.conf, assuming EC2 provides DNS name resolution--GCE does.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New""># Important for cloud: do not assume the nodes will retain their IP<br>
# addresses, and do not cache name-to-IP mapping.<br>
CommunicationParameters=NoAddrCache<br>
SlurmctldParameters=cloud_dns,idle_on_node_suspend<br>
PrivateData=cloud # Always show cloud nodes.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New"">ReturnToService=2 # When a DOWN node boots, it becomes available.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New"">Hope this might help.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Courier New""> -kkm</span><o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Thu, Feb 27, 2020 at 4:11 PM Carter, Allan <<a href="mailto:cartalla@amazon.com">cartalla@amazon.com</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I’m setting up an EC2 SLURM cluster and when an instance doesn’t resume fast enough I get an error like:<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">node c7-c5-24xl-464 not resumed by ResumeTimeout(600) - marking down and power_save<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I keep running into issues where my cloud nodes do not show up in sinfo and I can’t display their information with scontrol. This makes it difficult to know which of my CLOUD nodes
are available for scheduling and which are down for some reason and can’t be used. I haven’t figured out when slurm will show a cloud node and when it won’t and this make it pretty hard to manage the cluster.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Would I be better off just removing the CLOUD attribute on my EC2 nodes? What is the advantage of making them CLOUD nodes if it just make it more difficult to manage the cluster?<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
</div>
</blockquote>
</div>
</div>
</body>
</html>