I am doing a new install of slurm 24.05.3 I have all the packages built and installed on headnode and compute node with the same munge.key, slurm.conf, and gres.conf file. I was able to run munge and unmunge commands to test munge successfully. Time is synced with chronyd. I can’t seem to find any useful errors in the logs. For some reason when I run sinfo no nodes are listed. I just see the headers for each column. Has anyone seen this or know what a next step of troubleshooting would be? I’m new to this and not sure where to go from here. Thanks for any and all help!

 

The odd output I am seeing

[username@headnode ~] sinfo

PARTITION AVAIL    TIMELIMIT NODES   STATE   NODELIST

 

(Nothing is output showing status of partition or nodes)

 

 

Slurm.conf

 

ClusterName=slurmkvasir

SlurmctldHost=kadmin2

MpiDefault=none

ProctrackType=proctrack/cgroup

PrologFlags=contain

ReturnToService=2

SlurmctldPidFile=/var/run/slurm/slurmctld.pid

SlurmctldPort=6817

SlurmPidFile=/var/run/slurm/slurmd.pid

SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=slurm

StateSaveLocation=/var/spool/slurmctld

TaskPlugin=task/cgroup

MinJobAge=600

SchedulerType=sched/backfill

SelectType=select/cons_tres

PriorityType=priority/multifactor

AccountingStorageHost=localhost

AccountingStoragePass=/var/run/munge/munge.socket.2

AccountingStorageType=accounting_storage/slurmdbd

AccountingStorageTRES=gres/gpu,cpu,node

JobCompType=jobcomp/none

JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/cgroup

SlurmctldDebug=info

SlurmctldLogFile=/var/log/slurm/slurmctld.log

SlurmdDebug=info

SlurmLogFile=/var/log/slurm/slurmd.log

nodeName=k[001-448]

PartitionName=default Nodes=k[001-448] Default=YES MaxTime=INFINITE State=up

 

Slurmctld.log

 

Error: Configured MailProg is invalid

Slurmctld version 24.05.3 started on cluster slurmkvasir

Accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Regisetering slurmctld at port 8617

Error: read_slurm_conf: default partition not set.

Revovered state of 448 nodes

Down nodes: k[002-448]

Recovered information about 0 jobs

Revovered state of 0 reservations

Read_slurm_conf: backup_controller not specified

Select/cons_tres; select_p_reconfigure: select/cons_tres: reconfigure

Running as primary controller

 

Slurmd.log

 

Error: Node configuration differs from hardware: CPUS=1:40(hw) Boards=1:1(hw) SocketsPerBoard=1:2(hw) CoresPerSocket=1:20(hw) ThreadsPerCore:1:1(hw)

CPU frequency setting not configured for this node

Slurmd version 24.05.3started

Slurmd started on Wed, 27 Nov 2024 06:51:03 -0700

CPUS=1 Boards=1 Cores=1 Threads=1 Memory=192030 TmpDisk=95201 uptime 166740 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

Error: _forward_thread: failed to k019 (10.142.0.119:6818): Connection timed out

(Above line repeated 20 or so times for different nodes.)

 

Thanks,

Kent Hanson