slurmctld HA ; backup controller doesn't schedule and start any job
Hi all,
I am trying a slurmctld HA configuration on two servers, using slurm version 22.05.9 of AlmaLinux 9.4.
The problem is, after stopping the primary slurmctld and slurmdbd, when I submit a job with sbatch while backup slurmctld and slurmdbd are running, the job will be pending with Reason=None, and it will not be scheduled and will not start running.
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
43 cpu_multi job hpc PD 0:00 1 (None)
Why won't my job start running? What should I change to get the job to start running?
The slurmctld.log and configuration files are shown below.
Backup's slurmctld.log:
[2025-04-09T15:31:17.000] debug3: Heartbeat at 1744180276
[2025-04-09T15:31:18.000] debug3: Heartbeat at 1744180277
[2025-04-09T15:31:19.022] debug3: Heartbeat at 1744180279
[2025-04-09T15:31:19.605] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from UID=1000
[2025-04-09T15:31:19.605] debug3: _set_hostname: Using auth hostname for alloc_node: compute1
[2025-04-09T15:31:19.605] debug3: JobDesc: user_id=1000 JobId=N/A partition=cpu_multi name=job
[2025-04-09T15:31:19.605] debug3: cpus=2-4294967294 pn_min_cpus=1 core_spec=-1
[2025-04-09T15:31:19.605] debug3: Nodes=1-[1] Sock/Node=65534 Core/Sock=65534 Thread/Core=65534
[2025-04-09T15:31:19.605] debug3: pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1
[2025-04-09T15:31:19.605] debug3: immediate=0 reservation=(null)
[2025-04-09T15:31:19.605] debug3: features=(null) batch_features=(null) cluster_features=(null) prefer=(null)
[2025-04-09T15:31:19.605] debug3: req_nodes=(null) exc_nodes=(null)
[2025-04-09T15:31:19.605] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1
[2025-04-09T15:31:19.605] debug3: kill_on_node_fail=-1 script=#! /bin/bash
#SBATCH -p cpu_multi
#SBATC...
[2025-04-09T15:31:19.605] debug3: argv="./twocore.sh"
[2025-04-09T15:31:19.605] debug3: environment=SHELL=/bin/bash,PYENV_SHELL=bash,HISTCONTROL=ignoredups,...
[2025-04-09T15:31:19.605] debug3: stdin=/dev/null stdout=/misc/home/hpc/slurmtest/twocore_%J.out stderr=(null)
[2025-04-09T15:31:19.605] debug3: work_dir=/misc/home/hpc/slurmtest alloc_node:sid=compute1:281600
[2025-04-09T15:31:19.605] debug3: power_flags=
[2025-04-09T15:31:19.605] debug3: resp_host=(null) alloc_resp_port=0 other_port=0
[2025-04-09T15:31:19.605] debug3: dependency=(null) account=(null) qos=(null) comment=(null)
[2025-04-09T15:31:19.605] debug3: mail_type=0 mail_user=(null) nice=0 num_tasks=2 open_mode=0 overcommit=-1 acctg_freq=(null)
[2025-04-09T15:31:19.605] debug3: network=(null) begin=Unknown cpus_per_task=1 requeue=-1 licenses=(null)
[2025-04-09T15:31:19.605] debug3: end_time= signal=0@0 wait_all_nodes=-1 cpu_freq=
[2025-04-09T15:31:19.605] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 ntasks_per_tres=-1
[2025-04-09T15:31:19.605] debug3: mem_bind=0:(null) plane_size:65534
[2025-04-09T15:31:19.605] debug3: array_inx=(null)
[2025-04-09T15:31:19.605] debug3: burst_buffer=(null)
[2025-04-09T15:31:19.605] debug3: mcs_label=(null)
[2025-04-09T15:31:19.605] debug3: deadline=Unknown
[2025-04-09T15:31:19.605] debug3: bitflags=0x1a00c000 delay_boot=4294967294
[2025-04-09T15:31:19.605] debug3: assoc_mgr_fill_in_user: found correct user: hpc(1000)
[2025-04-09T15:31:19.605] debug5: assoc_mgr_fill_in_assoc: looking for assoc of user=hpc(1000), acct=hpc, cluster=cluster, partition=cpu_multi
[2025-04-09T15:31:19.605] debug3: assoc_mgr_fill_in_assoc: found correct association of user=hpc(1000), acct=hpc, cluster=cluster, partition=cpu_multi to assoc=16 acct=hpc
[2025-04-09T15:31:19.605] debug3: found correct qos
[2025-04-09T15:31:19.607] debug2: priority/multifactor: priority_p_set: initial priority for job 44 is 33
[2025-04-09T15:31:19.607] debug2: found 1 usable nodes from config containing compute1
[2025-04-09T15:31:19.607] debug2: found 1 usable nodes from config containing compute2
[2025-04-09T15:31:19.607] debug3: _pick_best_nodes: JobId=44 idle_nodes 2 share_nodes 2
[2025-04-09T15:31:19.607] debug2: select/cons_tres: select_p_job_test: evaluating JobId=44
[2025-04-09T15:31:19.607] debug2: sched: JobId=44 allocated resources: NodeList=(null)
[2025-04-09T15:31:19.607] _slurm_rpc_submit_batch_job: JobId=44 InitPrio=33 usec=2490
[2025-04-09T15:31:19.608] debug3: create_mmap_buf: loaded file `/var/spool/slurm/ctld/job_state` as buf_t
[2025-04-09T15:31:19.609] debug3: Writing job id 45 to header record of job_state file
[2025-04-09T15:31:21.000] debug3: Heartbeat at 1744180280
[2025-04-09T15:31:21.257] debug2: _slurm_connect: failed to connect to 192.168.56.11:6817: Connection refused
[2025-04-09T15:31:21.257] debug2: Error connecting slurm stream socket at 192.168.56.11:6817: Connection refused
[2025-04-09T15:31:22.000] debug3: Heartbeat at 1744180282
[2025-04-09T15:31:24.000] debug3: Heartbeat at 1744180283
[2025-04-09T15:31:25.004] debug3: Heartbeat at 1744180285
[2025-04-09T15:31:27.001] debug3: Heartbeat at 1744180287
[2025-04-09T15:31:29.000] debug3: Heartbeat at 1744180288
[2025-04-09T15:31:30.000] debug3: Heartbeat at 1744180289
[2025-04-09T15:31:31.006] debug3: Heartbeat at 1744180291
[2025-04-09T15:31:32.822] debug2: _slurm_connect: failed to connect to 192.168.56.11:6817: Connection refused
[2025-04-09T15:31:32.822] debug2: Error connecting slurm stream socket at 192.168.56.11:6817: Connection refused
[2025-04-09T15:31:33.007] debug3: Heartbeat at 1744180293
[2025-04-09T15:31:35.000] debug3: Heartbeat at 1744180294
[2025-04-09T15:31:36.002] debug3: Heartbeat at 1744180296
[2025-04-09T15:31:36.395] debug2: select/cons_tres: select_p_job_test: evaluating JobId=43
[2025-04-09T15:31:36.395] debug2: select/cons_tres: select_p_job_test: evaluating JobId=44
[2025-04-09T15:31:38.000] debug3: Heartbeat at 1744180297
[2025-04-09T15:31:38.497] debug2: Performing purge of old job records
[2025-04-09T15:31:39.000] debug3: Heartbeat at 1744180298
[2025-04-09T15:31:40.000] debug3: Heartbeat at 1744180300
[2025-04-09T15:31:40.655] debug2: Testing job time limits and checkpoints
[2025-04-09T15:31:42.000] debug3: Heartbeat at 1744180301
[2025-04-09T15:31:43.000] debug3: Heartbeat at 1744180302
slurm.conf:
ClusterName=cluster
SlurmctldHost=gateway1 #Primary(192.168.56.11)
SlurmctldHost=gateway2 #Backup(192.168.56.12)
MpiDefault=pmix
ProctrackType=proctrack/cgroup
PrologFlags=Contain
ReturnToService=0
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskEpilog=/etc/slurm/taskepilog.sh
TaskPlugin=task/cgroup,task/affinity
TaskProlog=/etc/slurm/taskprolog.sh
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=10
SlurmdTimeout=300
Waittime=0
DefMemPerCPU=32
SchedulerType=sched/builtin
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityWeightPartition=1000
AccountingStorageHost=gateway1
AccountingStorageBackupHost=gateway2
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreFlags=job_comment
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=debug5
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=compute1 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=3900 Weight=1
NodeName=compute2 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=3900 Weight=1
PartitionName=gpu_single Nodes=ALL PriorityJobFactor=30 MaxTime=INFINITE State=UP Default=YES
PartitionName=cpu_single Nodes=ALL PriorityJobFactor=10 MaxTime=INFINITE State=UP
PartitionName=cpu_multi Nodes=ALL MaxTime=INFINITE State=UP
slurmdbd.conf:
AuthType=auth/munge
DebugLevel=4
DbdHost=gateway1
DbdBackupHost=gateway2
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurm/slurmdbd.pid
PurgeEventAfter=1month
PurgeJobAfter=1month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=1month
PurgeUsageAfter=1month
SlurmUser=slurm
StorageType=accounting_storage/mysql
StorageHost=gateway1
StoragePass=mypassword
StorageUser=slurm
Best regards,
Hiro