[slurm-users] (no subject)

Soichi Hayashi hayashis at iu.edu
Fri Jul 30 18:21:19 UTC 2021


Hello. I need a help with troubleshooting our slurm cluster.

I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud
infrastructure (Jetstream) using an elastic computing mechanism (
https://slurm.schedmd.com/elastic_computing.html). Our cluster works for
the most part, but for some reason, a few of our nodes constantly goes into
"down" state.

PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES
STATE NODELIST
cloud*       up 2-00:00:00 1-infinite   no    YES:4        all     10
idle~ slurm9-compute[1-5,10,12-15]
cloud*       up 2-00:00:00 1-infinite   no    YES:4        all      5
 down slurm9-compute[6-9,11]

The only log I see in the slurm log is this..

[2021-07-30T15:10:55.889] Invalid node state transition requested for node
slurm9-compute6 from=COMPLETING to=RESUME
[2021-07-30T15:21:37.339] Invalid node state transition requested for node
slurm9-compute6 from=COMPLETING* to=RESUME
[2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason set to:
completing
[2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state set to
DOWN
[2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state set to
IDLE
..
[2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not
responding, setting DOWN

WIth elastic computing, any unused nodes are automatically removed
(by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are
*expected* to not respond once they are removed, but they should not be
marked as DOWN. They should simply be set to "idle".

To work around this issue, I am running the following cron job.

0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume

This "works" somewhat.. but our nodes go to "DOWN" state so often that
running this every hour is not enough.

Here is the full content of our slurm.conf

root at slurm9:~# cat /etc/slurm-llnl/slurm.conf
ClusterName=slurm9
ControlMachine=slurm9

SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
ProctrackType=proctrack/pgid
ReturnToService=1
Prolog=/usr/local/sbin/slurm_prolog.sh

#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
#make slurm a little more tolerant here
MessageTimeout=30
TCPTimeout=15
BatchStartTimeout=20
GetEnvTimeout=20
InactiveLimit=0
MinJobAge=604800
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#FastSchedule=0

# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none

# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30

AccountingStorageType=accounting_storage/filetxt
AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log

#CLOUD CONFIGURATION
PrivateData=cloud
ResumeProgram=/usr/local/sbin/slurm_resume.sh
SuspendProgram=/usr/local/sbin/slurm_suspend.sh
ResumeRate=1 #number of nodes per minute that can be created; 0 means no
limit
ResumeTimeout=900 #max time in seconds between ResumeProgram running and
when the node is ready for use
SuspendRate=1 #number of nodes per minute that can be suspended/destroyed
SuspendTime=600 #time in seconds before an idle node is suspended
SuspendTimeout=300 #time between running SuspendProgram and the node being
completely down
TreeWidth=30

NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24 RealMemory=60388
PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15] Default=YES
MaxTime=48:00:00 State=UP Shared=YES

I appreciate your assistance!

Soichi Hayashi
Indiana University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210730/cabaa1e4/attachment.htm>


More information about the slurm-users mailing list