trying to figure out how to troubleshoot cloud node resume/suspend - slurm-users

23 Aug 2024


      Hi all,
I have a cloud cluster running in GCP that seems to have gotten stuck
in a state where the slurmctld will not start/stop compute nodes, it
just sits there with thousands of jobs in the queue and only a few
compute nodes up and running (out of thousands).
I can try to kick it by setting node states to
drain/resume/power_up/power_down, whatever, nothing seems to happen.
In slurmctld.log I get stuff like this:
2024-08-22 16:44:31,761 INFO: created 1 instances:
nodes=hpcmarjola-computenodeset-384
but then in resume.log there is no corresponding entry
One hypothesis I have is that it's taking like 30+mins to spin up a
new node, but I know GCP doesn't take that long to create the VM, so
how can I troubleshoot more what the slurm resume script is doing?
e.g. I saw in the log:
[2024-08-22T16:39:57.151] powering up node hpcmarjola-computenodeset-7
[2024-08-22T16:59:10.371] Node hpcmarjola-computenodeset-7 now responding
I tried setting
SlurmctldDebug = verbose #instead of info, so it prints power / resume
status messages to the log
And I see things like this:
[2024-08-23T16:40:29.006] node hpcmarjola-computenodeset-1068 not
resumed by ResumeTimeout(600) - marking down and power_save
[2024-08-23T16:40:29.006] requeue job JobId=184535_503(185038) due to
failure of node hpcmarjola-computenodeset-1068
[2024-08-23T16:40:29.006] Requeuing JobId=184535_503(185038)
[2024-08-23T16:40:29.007] POWER: power_save: waking nodes
hpcmarjola-computenodeset-[2154-2156]
[2024-08-23T16:40:29.007] POWER: power_save: handle failed nodes
hpcmarjola-computenodeset-[1065-1068]
I'll try bumping this value but that seems crazy high:
ResumeTimeout=2400
slurm 23.11.7
I read through these pages:
https://slurm.schedmd.com/power_save.html
my config files (generated by gcp hpc toolkit scripts):
topology.conf:
SwitchName=nodeset-root Switches=computenodeset
SwitchName=computenodeset Nodes=hpcmarjola-computenodeset-[0-4999]
cloud.conf:
PrivateData=cloud
LaunchParameters=enable_nss_slurm,use_interactive_step
SlurmctldParameters=cloud_dns,enable_configless,idle_on_node_suspend
SchedulerParameters=bf_continue,salloc_wait_nodes,ignore_prefer_validation
SuspendProgram=/slurm/scripts/suspend.py
ResumeProgram=/slurm/scripts/resume.py
ResumeFailProgram=/slurm/scripts/suspend.py
ResumeRate=2
ResumeTimeout=600
SuspendRate=2
SuspendTimeout=600
TreeWidth=16
TopologyPlugin=topology/tree
NodeSet=x-login Feature=x-login
PartitionName=x-login Nodes=x-login State=UP DefMemPerCPU=1 Hidden=YES
RootOnly=YES
NodeName=DEFAULT State=UNKNOWN RealMemory=126832 Boards=1
SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 CPUs=8
NodeName=hpcmarjola-computenodeset-[0-4999] State=CLOUD
NodeSet=computenodeset Nodes=hpcmarjola-computenodeset-[0-4999]
PartitionName=compute Nodes=computenodeset State=UP DefMemPerCPU=15854
SuspendTime=600 Oversubscribe=Exclusive PowerDownOnIdle=NO Default=YES
ResumeTimeout=600 SuspendTimeout=600
SuspendExcParts=x-login
slurm.conf:
# slurm.conf
# https://slurm.schedmd.com/slurm.conf.html
# https://slurm.schedmd.com/configurator.html
ProctrackType=proctrack/cgroup
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
TaskPlugin=task/affinity,task/cgroup
MaxNodeCount=64000
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
#
#
# LOGGING AND ACCOUNTING
AccountingStoreFlags=job_comment
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmdDebug=info
DebugFlags=Power
#
#
# TIMERS
MessageTimeout=60
################################################################################
#              vvvvv  WARNING: DO NOT MODIFY SECTION BELOW  vvvvv              #
################################################################################
SlurmctldHost=hpcmarjola-controller(internal-cloud-dns-redacted)
AuthType=auth/munge
AuthInfo=cred_expire=120
AuthAltTypes=auth/jwt
CredType=cred/munge
MpiDefault=none
ReturnToService=2
SlurmctldPort=6820-6830
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=hpcmarjola-controller
ClusterName=hpcmarjola
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd-%n.log
#
#
# GENERATED CLOUD CONFIGURATIONS
include cloud.conf
################################################################################
#              ^^^^^  WARNING: DO NOT MODIFY SECTION ABOVE  ^^^^^              #
################################################################################
MaxArraySize            = 200001
MaxJobCount             = 200001