[slurm-users] Down nodes

Fri Jul 30 19:03:50 UTC 2021

Soichi,

(I added a subject)

You want to do 'sinfo -R' to find out the reason they are going down.

You may also want to bump up your logging to get more info. It looks 
like they are getting stuck in a 'completing' state. You will want to 
look at your slurmd logs on the nodes themselves for further info as to why.

Brian Andrus

On 7/30/2021 11:21 AM, Soichi Hayashi wrote:
> Hello. I need a help with troubleshooting our slurm cluster.
>
> I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud 
> infrastructure (Jetstream) using an elastic computing 
> mechanism (https://slurm.schedmd.com/elastic_computing.html 
> <https://slurm.schedmd.com/elastic_computing.html>). Our cluster works 
> for the most part, but for some reason, a few of our nodes constantly 
> goes into "down" state.
>
> PARTITION AVAIL  TIMELIMIT JOB_SIZE ROOT OVERSUBS     GROUPS  NODES   
>     STATE NODELIST
> cloud*       up 2-00:00:00 1-infinite   no    YES:4  all     10       
> idle~ slurm9-compute[1-5,10,12-15]
> cloud*       up 2-00:00:00 1-infinite   no    YES:4  all      5       
>  down slurm9-compute[6-9,11]
>
> The only log I see in the slurm log is this..
>
> [2021-07-30T15:10:55.889] Invalid node state transition requested for 
> node slurm9-compute6 from=COMPLETING to=RESUME
> [2021-07-30T15:21:37.339] Invalid node state transition requested for 
> node slurm9-compute6 from=COMPLETING* to=RESUME
> [2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason set 
> to: completing
> [2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state set 
> to DOWN
> [2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state set 
> to IDLE
> ..
> [2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not 
> responding, setting DOWN
>
> WIth elastic computing, any unused nodes are automatically removed 
> (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are 
> *expected* to not respond once they are removed, but they should not 
> be marked as DOWN. They should simply be set to "idle".
>
> To work around this issue, I am running the following cron job.
>
> 0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume
>
> This "works" somewhat.. but our nodes go to "DOWN" state so often that 
> running this every hour is not enough.
>
> Here is the full content of our slurm.conf
>
> root at slurm9:~# cat /etc/slurm-llnl/slurm.conf
> ClusterName=slurm9
> ControlMachine=slurm9
>
> SlurmUser=slurm
> SlurmdUser=root
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> StateSaveLocation=/tmp
> SlurmdSpoolDir=/tmp/slurmd
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> ProctrackType=proctrack/pgid
> ReturnToService=1
> Prolog=/usr/local/sbin/slurm_prolog.sh
>
> #
> # TIMERS
> SlurmctldTimeout=300
> SlurmdTimeout=300
> #make slurm a little more tolerant here
> MessageTimeout=30
> TCPTimeout=15
> BatchStartTimeout=20
> GetEnvTimeout=20
> InactiveLimit=0
> MinJobAge=604800
> KillWait=30
> Waittime=0
> #
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU_Memory
> #FastSchedule=0
>
> # LOGGING
> SlurmctldDebug=3
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> SlurmdDebug=3
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> JobCompType=jobcomp/none
>
> # ACCOUNTING
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30
>
> AccountingStorageType=accounting_storage/filetxt
> AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log
>
> #CLOUD CONFIGURATION
> PrivateData=cloud
> ResumeProgram=/usr/local/sbin/slurm_resume.sh
> SuspendProgram=/usr/local/sbin/slurm_suspend.sh
> ResumeRate=1 #number of nodes per minute that can be created; 0 means 
> no limit
> ResumeTimeout=900 #max time in seconds between ResumeProgram running 
> and when the node is ready for use
> SuspendRate=1 #number of nodes per minute that can be suspended/destroyed
> SuspendTime=600 #time in seconds before an idle node is suspended
> SuspendTimeout=300 #time between running SuspendProgram and the node 
> being completely down
> TreeWidth=30
>
> NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24 RealMemory=60388
> PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15] Default=YES 
> MaxTime=48:00:00 State=UP Shared=YES
>
> I appreciate your assistance!
>
> Soichi Hayashi
> Indiana University
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210730/106c291e/attachment-0001.htm>