Hi all,
I have a cloud cluster running in GCP that seems to have gotten stuck in a state where the slurmctld will not start/stop compute nodes, it just sits there with thousands of jobs in the queue and only a few compute nodes up and running (out of thousands).
I can try to kick it by setting node states to drain/resume/power_up/power_down, whatever, nothing seems to happen.
In slurmctld.log I get stuff like this: 2024-08-22 16:44:31,761 INFO: created 1 instances: nodes=hpcmarjola-computenodeset-384
but then in resume.log there is no corresponding entry
One hypothesis I have is that it's taking like 30+mins to spin up a new node, but I know GCP doesn't take that long to create the VM, so how can I troubleshoot more what the slurm resume script is doing? e.g. I saw in the log: [2024-08-22T16:39:57.151] powering up node hpcmarjola-computenodeset-7 [2024-08-22T16:59:10.371] Node hpcmarjola-computenodeset-7 now responding
I tried setting SlurmctldDebug = verbose #instead of info, so it prints power / resume status messages to the log
And I see things like this: [2024-08-23T16:40:29.006] node hpcmarjola-computenodeset-1068 not resumed by ResumeTimeout(600) - marking down and power_save [2024-08-23T16:40:29.006] requeue job JobId=184535_503(185038) due to failure of node hpcmarjola-computenodeset-1068 [2024-08-23T16:40:29.006] Requeuing JobId=184535_503(185038) [2024-08-23T16:40:29.007] POWER: power_save: waking nodes hpcmarjola-computenodeset-[2154-2156] [2024-08-23T16:40:29.007] POWER: power_save: handle failed nodes hpcmarjola-computenodeset-[1065-1068]
I'll try bumping this value but that seems crazy high: ResumeTimeout=2400
slurm 23.11.7
I read through these pages: https://slurm.schedmd.com/power_save.html
my config files (generated by gcp hpc toolkit scripts):
topology.conf: SwitchName=nodeset-root Switches=computenodeset SwitchName=computenodeset Nodes=hpcmarjola-computenodeset-[0-4999]
cloud.conf: PrivateData=cloud LaunchParameters=enable_nss_slurm,use_interactive_step SlurmctldParameters=cloud_dns,enable_configless,idle_on_node_suspend SchedulerParameters=bf_continue,salloc_wait_nodes,ignore_prefer_validation SuspendProgram=/slurm/scripts/suspend.py ResumeProgram=/slurm/scripts/resume.py ResumeFailProgram=/slurm/scripts/suspend.py ResumeRate=2 ResumeTimeout=600 SuspendRate=2 SuspendTimeout=600 TreeWidth=16 TopologyPlugin=topology/tree
NodeSet=x-login Feature=x-login PartitionName=x-login Nodes=x-login State=UP DefMemPerCPU=1 Hidden=YES RootOnly=YES
NodeName=DEFAULT State=UNKNOWN RealMemory=126832 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 CPUs=8 NodeName=hpcmarjola-computenodeset-[0-4999] State=CLOUD NodeSet=computenodeset Nodes=hpcmarjola-computenodeset-[0-4999]
PartitionName=compute Nodes=computenodeset State=UP DefMemPerCPU=15854 SuspendTime=600 Oversubscribe=Exclusive PowerDownOnIdle=NO Default=YES ResumeTimeout=600 SuspendTimeout=600
SuspendExcParts=x-login
slurm.conf:
# slurm.conf # https://slurm.schedmd.com/slurm.conf.html # https://slurm.schedmd.com/configurator.html
ProctrackType=proctrack/cgroup SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmdPidFile=/var/run/slurm/slurmd.pid TaskPlugin=task/affinity,task/cgroup MaxNodeCount=64000
# # # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory
# # # LOGGING AND ACCOUNTING AccountingStoreFlags=job_comment JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup SlurmctldDebug=info SlurmdDebug=info DebugFlags=Power
# # # TIMERS MessageTimeout=60
################################################################################ # vvvvv WARNING: DO NOT MODIFY SECTION BELOW vvvvv # ################################################################################
SlurmctldHost=hpcmarjola-controller(internal-cloud-dns-redacted)
AuthType=auth/munge AuthInfo=cred_expire=120 AuthAltTypes=auth/jwt CredType=cred/munge MpiDefault=none ReturnToService=2 SlurmctldPort=6820-6830 SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurm
# # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=hpcmarjola-controller ClusterName=hpcmarjola SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd-%n.log
# # # GENERATED CLOUD CONFIGURATIONS include cloud.conf
################################################################################ # ^^^^^ WARNING: DO NOT MODIFY SECTION ABOVE ^^^^^ # ################################################################################
MaxArraySize = 200001 MaxJobCount = 200001