<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>You should definitely upgrade because there have been significant
improvements in that area.</p>
<p>You can label nodes as cloud nodes and merely updating the state
to 'power_down' will run your suspend script.</p>
<p>Brian Andrus<br>
</p>
<div class="moz-cite-prefix">On 7/30/2021 5:05 PM, Soichi Hayashi
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAGLeeFiPnyBuoTqzPNoqTJ7wQ46-+PNHr8ZfxVsCp2-mM8MZ+A@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Brian,
<div><br>
</div>
<div>Yes, slurmd is not running on that node because the node
itself is not there anymore (the whole VM is gone!). When the
node is no longer in use, slurm automatically runs
slurm_suspend.sh script which removes the whole node(VM) by
running "openstack server delete $host". There is no
server/VM, no IP address, no DNS name, nothing.
"slurm4-compute9" only exists as a hypothetical node that can
be launched in the future in case there are more jobs to run.
That's how "cloud" partition works, right?</div>
<div><br>
</div>
<div><font face="monospace">[slurm.conf]</font></div>
<div><font face="monospace">SuspendProgram=/usr/local/sbin/slurm_suspend.sh<br>
SuspendTime=600 #time in seconds before an idle node is
suspended</font></div>
<div><br>
</div>
<div>I am wondering.. maybe something went wrong when slurm ran
slurm_suspend.sh so that slurm *thinks* that the node is still
there.. so it tries to ping it, and it fails to ping it
(obviously...) and marking it as DOWN?</div>
<div><br>
</div>
<div>I don't know if my theory is right or not.. but just to get
our cluster going again, is there a way to force slurm to
forget about the node that it "suspended" earlier? Is there a
command like "scontrol forcesuspend node=$id"?</div>
<div><br>
</div>
<div>Thank you for your help!<br>
</div>
<div><br>
</div>
<div>-soichi </div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Jul 30, 2021 at 7:56
PM Brian Andrus <<a href="mailto:toomuchit@gmail.com"
moz-do-not-send="true">toomuchit@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div
style="font-family:Arial,Helvetica,sans-serif;font-size:12px;background-color:rgb(255,236,229);color:rgb(130,39,13);border-left:0.25rem
solid
rgb(223,54,3);padding:0.5rem;text-align:left;line-height:1.25">
This message was sent from a non-IU address. Please
exercise caution when clicking links or opening
attachments from external sources.</div>
<br>
<p>That 'not responding' is the issue and usually means 1 of
2 things:</p>
<p>1) slurmd is not running on the node<br>
2) something on the network is stopping the communication
between the node and the master (firewall, selinux,
congestion, bad nic, routes, etc)</p>
<p>Brian Andrus<br>
</p>
<div>On 7/30/2021 3:51 PM, Soichi Hayashi wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Brian,</div>
<div><br>
</div>
<div>Thank you for your reply and thanks for setting the
email title. I forgot to edit it before I sent it!</div>
<div><br>
</div>
<div>I am not sure how I can reply to your your reply..
but I hope this make it so the right place..</div>
<div><br>
</div>
<div>I've updated slurm.conf to increase the controller
debug level</div>
<div>> SlurmctldDebug=5</div>
<div><br>
</div>
<div>I now see additional log output (debug).</div>
<div><br>
</div>
<div><font face="monospace">[2021-07-30T22:42:05.255]
debug: Spawning ping agent for
slurm4-compute[2-6,10,12-14]<br>
[2021-07-30T22:42:05.256] error: Nodes
slurm4-compute[9,15,19-22,30] not responding,
setting DOWN</font><br>
</div>
<div><br>
</div>
<div>It's still very sparse, but it looks like slurm is
trying to ping nodes that are already removed (they
don't exist anymore - as they are removed by
slurm_suspend.sh script)</div>
<div><br>
</div>
<div>I tried sinfo -R but it doesn't really give much
info..</div>
<div><br>
</div>
<div><font face="monospace">$ sinfo -R<br>
REASON USER TIMESTAMP
NODELIST<br>
Not responding slurm 2021-07-30T22:42:05
slurm4-compute[9,15,19-22,30]</font><br>
</div>
<div><br>
</div>
<div>These machines are gone, so it should not respond. </div>
<div><br>
</div>
<div><font face="monospace">$ ping slurm4-compute9<br>
ping: slurm4-compute9: Name or service not known</font><br>
</div>
<div><br>
</div>
<div>This is expected.</div>
<div><br>
</div>
<div>Why is slurm keeps trying to contact the node
that's already removed? slurm_suspend.sh does the
following to "remove" node from the partition.</div>
<div><font face="monospace">> scontrol update
nodename=${host} nodeaddr="(null)"</font></div>
<div>Maybe this isn't the correct way to do it? Is there
a way to force slurm to forget about the node? I tried
"scontrol update node=$node state=idle", but this only
works for a few minutes until slurm's ping agent kicks
in and marking them down again.</div>
<div><br>
</div>
<div>Thanks!!</div>
<div>Soichi </div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Jul 30, 2021
at 2:21 PM Soichi Hayashi <<a
href="mailto:hayashis@iu.edu" target="_blank"
moz-do-not-send="true">hayashis@iu.edu</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px
0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr">Hello. I need a help with
troubleshooting our slurm cluster.
<div><br>
</div>
<div>I am running slurm-wlm 17.11.2 on Ubuntu 20
on a public cloud infrastructure (Jetstream)
using an elastic computing mechanism (<a
href="https://slurm.schedmd.com/elastic_computing.html"
target="_blank" moz-do-not-send="true">https://slurm.schedmd.com/elastic_computing.html</a>).
Our cluster works for the most part, but for
some reason, a few of our nodes constantly goes
into "down" state.
<div><br>
</div>
<div><font face="monospace">PARTITION AVAIL
TIMELIMIT JOB_SIZE ROOT OVERSUBS
GROUPS NODES STATE NODELIST<br>
cloud* up 2-00:00:00 1-infinite no
YES:4 all 10 idle~
slurm9-compute[1-5,10,12-15]<br>
cloud* up 2-00:00:00 1-infinite no
YES:4 all 5 down
slurm9-compute[6-9,11]</font><br>
<div><br>
</div>
<div>The only log I see in the slurm log is
this..</div>
<div><br>
</div>
<div><font face="monospace">[2021-07-30T15:10:55.889]
Invalid node state transition requested
for node slurm9-compute6 from=COMPLETING
to=RESUME<br>
[2021-07-30T15:21:37.339] Invalid node
state transition requested for node
slurm9-compute6 from=COMPLETING* to=RESUME<br>
[2021-07-30T15:27:30.039] update_node:
node slurm9-compute6 reason set to:
completing<br>
[2021-07-30T15:27:30.040] update_node:
node slurm9-compute6 state set to DOWN<br>
[2021-07-30T15:27:40.830] update_node:
node slurm9-compute6 state set to IDLE</font><br>
</div>
<div>..</div>
<div>[2021-07-30T15:34:20.628] error: Nodes
slurm9-compute[6-9,11] not responding,
setting DOWN<br>
</div>
</div>
<div><br>
</div>
<div>WIth elastic computing, any unused nodes
are automatically removed
(by SuspendProgram=/usr/local/sbin/slurm_suspend.sh).
So nodes are *expected* to not respond once
they are removed, but they should not be
marked as DOWN. They should simply be set to
"idle". </div>
</div>
<div><br>
</div>
<div>To work around this issue, I am running the
following cron job.</div>
<div><br>
</div>
<div><font face="monospace">0 0 * * * scontrol
update node=slurm9-compute[1-30] state=resume</font><br>
</div>
<div><br>
</div>
<div>This "works" somewhat.. but our nodes go to
"DOWN" state so often that running this every
hour is not enough.</div>
<div><br>
</div>
<div>Here is the full content of our slurm.conf</div>
<div><br>
</div>
<div><font face="monospace">root@slurm9:~# cat
/etc/slurm-llnl/slurm.conf <br>
ClusterName=slurm9<br>
ControlMachine=slurm9<br>
<br>
SlurmUser=slurm<br>
SlurmdUser=root<br>
SlurmctldPort=6817<br>
SlurmdPort=6818<br>
AuthType=auth/munge<br>
StateSaveLocation=/tmp<br>
SlurmdSpoolDir=/tmp/slurmd<br>
SwitchType=switch/none<br>
MpiDefault=none<br>
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid<br>
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid<br>
ProctrackType=proctrack/pgid<br>
ReturnToService=1<br>
Prolog=/usr/local/sbin/slurm_prolog.sh<br>
<br>
#<br>
# TIMERS<br>
SlurmctldTimeout=300<br>
SlurmdTimeout=300<br>
#make slurm a little more tolerant here<br>
MessageTimeout=30<br>
TCPTimeout=15<br>
BatchStartTimeout=20<br>
GetEnvTimeout=20<br>
InactiveLimit=0<br>
MinJobAge=604800<br>
KillWait=30<br>
Waittime=0<br>
#<br>
# SCHEDULING<br>
SchedulerType=sched/backfill<br>
SelectType=select/cons_res<br>
SelectTypeParameters=CR_CPU_Memory<br>
#FastSchedule=0<br>
<br>
# LOGGING<br>
SlurmctldDebug=3<br>
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log<br>
SlurmdDebug=3<br>
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log<br>
JobCompType=jobcomp/none<br>
<br>
# ACCOUNTING<br>
JobAcctGatherType=jobacct_gather/linux<br>
JobAcctGatherFrequency=30<br>
<br>
AccountingStorageType=accounting_storage/filetxt<br>
AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log<br>
<br>
#CLOUD CONFIGURATION<br>
PrivateData=cloud<br>
ResumeProgram=/usr/local/sbin/slurm_resume.sh<br>
SuspendProgram=/usr/local/sbin/slurm_suspend.sh<br>
ResumeRate=1 #number of nodes per minute that
can be created; 0 means no limit<br>
ResumeTimeout=900 #max time in seconds between
ResumeProgram running and when the node is
ready for use<br>
SuspendRate=1 #number of nodes per minute that
can be suspended/destroyed<br>
SuspendTime=600 #time in seconds before an
idle node is suspended<br>
SuspendTimeout=300 #time between running
SuspendProgram and the node being completely
down<br>
TreeWidth=30<br>
<br>
NodeName=slurm9-compute[1-15] State=CLOUD
CPUs=24 RealMemory=60388<br>
PartitionName=cloud LLN=YES
Nodes=slurm9-compute[1-15] Default=YES
MaxTime=48:00:00 State=UP Shared=YES</font><br>
</div>
<div><font face="monospace"><br>
</font></div>
<div><font face="arial, sans-serif">I appreciate
your assistance!</font></div>
<div><font face="arial, sans-serif"><br>
</font></div>
<div><font face="arial, sans-serif">Soichi Hayashi</font></div>
<div><font face="arial, sans-serif">Indiana
University</font></div>
<div><font face="arial, sans-serif"><br>
</font></div>
<div><font face="monospace"><br>
</font></div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</body>
</html>