<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Soichi,</p>
<p>(I added a subject)</p>
<p>You want to do 'sinfo -R' to find out the reason they are going
down.</p>
<p>You may also want to bump up your logging to get more info. It
looks like they are getting stuck in a 'completing' state. You
will want to look at your slurmd logs on the nodes themselves for
further info as to why.</p>
<div class="moz-cite-prefix">Brian Andrus</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 7/30/2021 11:21 AM, Soichi Hayashi
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAGLeeFhYS2BETBVP2UD8KrhKtKSBxTAQzjUpS_sy8ggObHtqYw@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Hello. I need a help with troubleshooting our slurm
cluster.
<div><br>
</div>
<div>I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public
cloud infrastructure (Jetstream) using an elastic computing
mechanism (<a
href="https://slurm.schedmd.com/elastic_computing.html"
moz-do-not-send="true">https://slurm.schedmd.com/elastic_computing.html</a>).
Our cluster works for the most part, but for some reason, a
few of our nodes constantly goes into "down" state.
<div><br>
</div>
<div><font face="monospace">PARTITION AVAIL TIMELIMIT
JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE
NODELIST<br>
cloud* up 2-00:00:00 1-infinite no YES:4
all 10 idle~ slurm9-compute[1-5,10,12-15]<br>
cloud* up 2-00:00:00 1-infinite no YES:4
all 5 down slurm9-compute[6-9,11]</font><br>
<div><br>
</div>
<div>The only log I see in the slurm log is this..</div>
<div><br>
</div>
<div><font face="monospace">[2021-07-30T15:10:55.889]
Invalid node state transition requested for node
slurm9-compute6 from=COMPLETING to=RESUME<br>
[2021-07-30T15:21:37.339] Invalid node state transition
requested for node slurm9-compute6 from=COMPLETING*
to=RESUME<br>
[2021-07-30T15:27:30.039] update_node: node
slurm9-compute6 reason set to: completing<br>
[2021-07-30T15:27:30.040] update_node: node
slurm9-compute6 state set to DOWN<br>
[2021-07-30T15:27:40.830] update_node: node
slurm9-compute6 state set to IDLE</font><br>
</div>
<div>..</div>
<div>[2021-07-30T15:34:20.628] error: Nodes
slurm9-compute[6-9,11] not responding, setting DOWN<br>
</div>
</div>
<div><br>
</div>
<div>WIth elastic computing, any unused nodes are
automatically removed
(by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So
nodes are *expected* to not respond once they are removed,
but they should not be marked as DOWN. They should simply be
set to "idle". </div>
</div>
<div><br>
</div>
<div>To work around this issue, I am running the following cron
job.</div>
<div><br>
</div>
<div><font face="monospace">0 0 * * * scontrol update
node=slurm9-compute[1-30] state=resume</font><br>
</div>
<div><br>
</div>
<div>This "works" somewhat.. but our nodes go to "DOWN" state so
often that running this every hour is not enough.</div>
<div><br>
</div>
<div>Here is the full content of our slurm.conf</div>
<div><br>
</div>
<div><font face="monospace">root@slurm9:~# cat
/etc/slurm-llnl/slurm.conf <br>
ClusterName=slurm9<br>
ControlMachine=slurm9<br>
<br>
SlurmUser=slurm<br>
SlurmdUser=root<br>
SlurmctldPort=6817<br>
SlurmdPort=6818<br>
AuthType=auth/munge<br>
StateSaveLocation=/tmp<br>
SlurmdSpoolDir=/tmp/slurmd<br>
SwitchType=switch/none<br>
MpiDefault=none<br>
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid<br>
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid<br>
ProctrackType=proctrack/pgid<br>
ReturnToService=1<br>
Prolog=/usr/local/sbin/slurm_prolog.sh<br>
<br>
#<br>
# TIMERS<br>
SlurmctldTimeout=300<br>
SlurmdTimeout=300<br>
#make slurm a little more tolerant here<br>
MessageTimeout=30<br>
TCPTimeout=15<br>
BatchStartTimeout=20<br>
GetEnvTimeout=20<br>
InactiveLimit=0<br>
MinJobAge=604800<br>
KillWait=30<br>
Waittime=0<br>
#<br>
# SCHEDULING<br>
SchedulerType=sched/backfill<br>
SelectType=select/cons_res<br>
SelectTypeParameters=CR_CPU_Memory<br>
#FastSchedule=0<br>
<br>
# LOGGING<br>
SlurmctldDebug=3<br>
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log<br>
SlurmdDebug=3<br>
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log<br>
JobCompType=jobcomp/none<br>
<br>
# ACCOUNTING<br>
JobAcctGatherType=jobacct_gather/linux<br>
JobAcctGatherFrequency=30<br>
<br>
AccountingStorageType=accounting_storage/filetxt<br>
AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log<br>
<br>
#CLOUD CONFIGURATION<br>
PrivateData=cloud<br>
ResumeProgram=/usr/local/sbin/slurm_resume.sh<br>
SuspendProgram=/usr/local/sbin/slurm_suspend.sh<br>
ResumeRate=1 #number of nodes per minute that can be
created; 0 means no limit<br>
ResumeTimeout=900 #max time in seconds between ResumeProgram
running and when the node is ready for use<br>
SuspendRate=1 #number of nodes per minute that can be
suspended/destroyed<br>
SuspendTime=600 #time in seconds before an idle node is
suspended<br>
SuspendTimeout=300 #time between running SuspendProgram and
the node being completely down<br>
TreeWidth=30<br>
<br>
NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24
RealMemory=60388<br>
PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15]
Default=YES MaxTime=48:00:00 State=UP Shared=YES</font><br>
</div>
<div><font face="monospace"><br>
</font></div>
<div><font face="arial, sans-serif">I appreciate your
assistance!</font></div>
<div><font face="arial, sans-serif"><br>
</font></div>
<div><font face="arial, sans-serif">Soichi Hayashi</font></div>
<div><font face="arial, sans-serif">Indiana University</font></div>
<div><font face="arial, sans-serif"><br>
</font></div>
<div><font face="monospace"><br>
</font></div>
</div>
</blockquote>
</body>
</html>