<div dir="ltr"><div>Brian,</div><div><br></div><div>Thank you for your reply and thanks for setting the email title. I forgot to edit it before I sent it!</div><div><br></div><div>I am not sure how I can reply to your your reply.. but I hope this make it so the right place..</div><div><br></div><div>I've updated slurm.conf to increase the controller debug level</div><div>> SlurmctldDebug=5</div><div><br></div><div>I now see additional log output (debug).</div><div><br></div><div><font face="monospace">[2021-07-30T22:42:05.255] debug:  Spawning ping agent for slurm4-compute[2-6,10,12-14]<br>[2021-07-30T22:42:05.256] error: Nodes slurm4-compute[9,15,19-22,30] not responding, setting DOWN</font><br></div><div><br></div><div>It's still very sparse, but it looks like slurm is trying to ping nodes that are already removed (they don't exist anymore - as they are removed by slurm_suspend.sh script)</div><div><br></div><div>I tried sinfo -R but it doesn't really give much info..</div><div><br></div><div><font face="monospace">$ sinfo -R<br>REASON               USER      TIMESTAMP           NODELIST<br>Not responding       slurm     2021-07-30T22:42:05 slurm4-compute[9,15,19-22,30]</font><br></div><div><br></div><div>These machines are gone, so it should not respond. </div><div><br></div><div><font face="monospace">$ ping slurm4-compute9<br>ping: slurm4-compute9: Name or service not known</font><br></div><div><br></div><div>This is expected.</div><div><br></div><div>Why is slurm keeps trying to contact the node that's already removed? slurm_suspend.sh does the following to "remove" node from the partition.</div><div><font face="monospace">> scontrol update nodename=${host} nodeaddr="(null)"</font></div><div>Maybe this isn't the correct way to do it? Is there a way to force slurm to forget about the node? I tried "scontrol update node=$node state=idle", but this only works for a few minutes until slurm's ping agent kicks in and marking them down again.</div><div><br></div><div>Thanks!!</div><div>Soichi </div><div><br></div><div><br></div><div><br></div><div><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 30, 2021 at 2:21 PM Soichi Hayashi <<a href="mailto:hayashis@iu.edu">hayashis@iu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hello. I need a help with troubleshooting our slurm cluster. <div><br></div><div>I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud infrastructure (Jetstream) using an elastic computing mechanism (<a href="https://slurm.schedmd.com/elastic_computing.html" target="_blank">https://slurm.schedmd.com/elastic_computing.html</a>). Our cluster works for the most part, but for some reason, a few of our nodes constantly goes into "down" state.<div><br></div><div><font face="monospace">PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST<br>cloud*       up 2-00:00:00 1-infinite   no    YES:4        all     10       idle~ slurm9-compute[1-5,10,12-15]<br>cloud*       up 2-00:00:00 1-infinite   no    YES:4        all      5        down slurm9-compute[6-9,11]</font><br><div><br></div><div>The only log I see in the slurm log is this..</div><div><br></div><div><font face="monospace">[2021-07-30T15:10:55.889] Invalid node state transition requested for node slurm9-compute6 from=COMPLETING to=RESUME<br>[2021-07-30T15:21:37.339] Invalid node state transition requested for node slurm9-compute6 from=COMPLETING* to=RESUME<br>[2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason set to: completing<br>[2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state set to DOWN<br>[2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state set to IDLE</font><br></div><div>..</div><div>[2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not responding, setting DOWN<br></div></div><div><br></div><div>WIth elastic computing, any unused nodes are automatically removed (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are *expected* to not respond once they are removed, but they should not be marked as DOWN. They should simply be set to "idle". </div></div><div><br></div><div>To work around this issue, I am running the following cron job.</div><div><br></div><div><font face="monospace">0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume</font><br></div><div><br></div><div>This "works" somewhat.. but our nodes go to "DOWN" state so often that running this every hour is not enough.</div><div><br></div><div>Here is the full content of our slurm.conf</div><div><br></div><div><font face="monospace">root@slurm9:~# cat /etc/slurm-llnl/slurm.conf <br>ClusterName=slurm9<br>ControlMachine=slurm9<br><br>SlurmUser=slurm<br>SlurmdUser=root<br>SlurmctldPort=6817<br>SlurmdPort=6818<br>AuthType=auth/munge<br>StateSaveLocation=/tmp<br>SlurmdSpoolDir=/tmp/slurmd<br>SwitchType=switch/none<br>MpiDefault=none<br>SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid<br>SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid<br>ProctrackType=proctrack/pgid<br>ReturnToService=1<br>Prolog=/usr/local/sbin/slurm_prolog.sh<br><br>#<br># TIMERS<br>SlurmctldTimeout=300<br>SlurmdTimeout=300<br>#make slurm a little more tolerant here<br>MessageTimeout=30<br>TCPTimeout=15<br>BatchStartTimeout=20<br>GetEnvTimeout=20<br>InactiveLimit=0<br>MinJobAge=604800<br>KillWait=30<br>Waittime=0<br>#<br># SCHEDULING<br>SchedulerType=sched/backfill<br>SelectType=select/cons_res<br>SelectTypeParameters=CR_CPU_Memory<br>#FastSchedule=0<br><br># LOGGING<br>SlurmctldDebug=3<br>SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log<br>SlurmdDebug=3<br>SlurmdLogFile=/var/log/slurm-llnl/slurmd.log<br>JobCompType=jobcomp/none<br><br># ACCOUNTING<br>JobAcctGatherType=jobacct_gather/linux<br>JobAcctGatherFrequency=30<br><br>AccountingStorageType=accounting_storage/filetxt<br>AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log<br><br>#CLOUD CONFIGURATION<br>PrivateData=cloud<br>ResumeProgram=/usr/local/sbin/slurm_resume.sh<br>SuspendProgram=/usr/local/sbin/slurm_suspend.sh<br>ResumeRate=1 #number of nodes per minute that can be created; 0 means no limit<br>ResumeTimeout=900 #max time in seconds between ResumeProgram running and when the node is ready for use<br>SuspendRate=1 #number of nodes per minute that can be suspended/destroyed<br>SuspendTime=600 #time in seconds before an idle node is suspended<br>SuspendTimeout=300 #time between running SuspendProgram and the node being completely down<br>TreeWidth=30<br><br>NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24 RealMemory=60388<br>PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15] Default=YES MaxTime=48:00:00 State=UP Shared=YES</font><br></div><div><font face="monospace"><br></font></div><div><font face="arial, sans-serif">I appreciate your assistance!</font></div><div><font face="arial, sans-serif"><br></font></div><div><font face="arial, sans-serif">Soichi Hayashi</font></div><div><font face="arial, sans-serif">Indiana University</font></div><div><font face="arial, sans-serif"><br></font></div><div><font face="monospace"><br></font></div></div>

</blockquote></div></div>