[slurm-users] Down nodes

Fri Jul 30 22:51:37 UTC 2021

Brian,

Thank you for your reply and thanks for setting the email title. I forgot
to edit it before I sent it!

I am not sure how I can reply to your your reply.. but I hope this make it
so the right place..

I've updated slurm.conf to increase the controller debug level
> SlurmctldDebug=5

I now see additional log output (debug).

[2021-07-30T22:42:05.255] debug:  Spawning ping agent for
slurm4-compute[2-6,10,12-14]
[2021-07-30T22:42:05.256] error: Nodes slurm4-compute[9,15,19-22,30] not
responding, setting DOWN

It's still very sparse, but it looks like slurm is trying to ping nodes
that are already removed (they don't exist anymore - as they are removed by
slurm_suspend.sh script)

I tried sinfo -R but it doesn't really give much info..

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       slurm     2021-07-30T22:42:05
slurm4-compute[9,15,19-22,30]

These machines are gone, so it should not respond.

$ ping slurm4-compute9
ping: slurm4-compute9: Name or service not known

This is expected.

Why is slurm keeps trying to contact the node that's already removed?
slurm_suspend.sh does the following to "remove" node from the partition.
> scontrol update nodename=${host} nodeaddr="(null)"
Maybe this isn't the correct way to do it? Is there a way to force slurm to
forget about the node? I tried "scontrol update node=$node state=idle", but
this only works for a few minutes until slurm's ping agent kicks in and
marking them down again.

Thanks!!
Soichi

On Fri, Jul 30, 2021 at 2:21 PM Soichi Hayashi <hayashis at iu.edu> wrote:

> Hello. I need a help with troubleshooting our slurm cluster.
>
> I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud
> infrastructure (Jetstream) using an elastic computing mechanism (
> https://slurm.schedmd.com/elastic_computing.html). Our cluster works for
> the most part, but for some reason, a few of our nodes constantly goes into
> "down" state.
>
> PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES
>   STATE NODELIST
> cloud*       up 2-00:00:00 1-infinite   no    YES:4        all     10
>   idle~ slurm9-compute[1-5,10,12-15]
> cloud*       up 2-00:00:00 1-infinite   no    YES:4        all      5
>    down slurm9-compute[6-9,11]
>
> The only log I see in the slurm log is this..
>
> [2021-07-30T15:10:55.889] Invalid node state transition requested for node
> slurm9-compute6 from=COMPLETING to=RESUME
> [2021-07-30T15:21:37.339] Invalid node state transition requested for node
> slurm9-compute6 from=COMPLETING* to=RESUME
> [2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason set to:
> completing
> [2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state set to
> DOWN
> [2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state set to
> IDLE
> ..
> [2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not
> responding, setting DOWN
>
> WIth elastic computing, any unused nodes are automatically removed
> (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are
> *expected* to not respond once they are removed, but they should not be
> marked as DOWN. They should simply be set to "idle".
>
> To work around this issue, I am running the following cron job.
>
> 0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume
>
> This "works" somewhat.. but our nodes go to "DOWN" state so often that
> running this every hour is not enough.
>
> Here is the full content of our slurm.conf
>
> root at slurm9:~# cat /etc/slurm-llnl/slurm.conf
> ClusterName=slurm9
> ControlMachine=slurm9
>
> SlurmUser=slurm
> SlurmdUser=root
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> StateSaveLocation=/tmp
> SlurmdSpoolDir=/tmp/slurmd
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> ProctrackType=proctrack/pgid
> ReturnToService=1
> Prolog=/usr/local/sbin/slurm_prolog.sh
>
> #
> # TIMERS
> SlurmctldTimeout=300
> SlurmdTimeout=300
> #make slurm a little more tolerant here
> MessageTimeout=30
> TCPTimeout=15
> BatchStartTimeout=20
> GetEnvTimeout=20
> InactiveLimit=0
> MinJobAge=604800
> KillWait=30
> Waittime=0
> #
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU_Memory
> #FastSchedule=0
>
> # LOGGING
> SlurmctldDebug=3
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> SlurmdDebug=3
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> JobCompType=jobcomp/none
>
> # ACCOUNTING
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30
>
> AccountingStorageType=accounting_storage/filetxt
> AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log
>
> #CLOUD CONFIGURATION
> PrivateData=cloud
> ResumeProgram=/usr/local/sbin/slurm_resume.sh
> SuspendProgram=/usr/local/sbin/slurm_suspend.sh
> ResumeRate=1 #number of nodes per minute that can be created; 0 means no
> limit
> ResumeTimeout=900 #max time in seconds between ResumeProgram running and
> when the node is ready for use
> SuspendRate=1 #number of nodes per minute that can be suspended/destroyed
> SuspendTime=600 #time in seconds before an idle node is suspended
> SuspendTimeout=300 #time between running SuspendProgram and the node being
> completely down
> TreeWidth=30
>
> NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24 RealMemory=60388
> PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15] Default=YES
> MaxTime=48:00:00 State=UP Shared=YES
>
> I appreciate your assistance!
>
> Soichi Hayashi
> Indiana University
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210730/c3844197/attachment.htm>