If you’re sure you’ve restarted everything after the config change, are you also sure that you don’t have that stuff hidden from your current user? You can try -a to rule that out. Or run as root.
-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'
On Nov 27, 2024, at 09:56, Kent L. Hanson via slurm-users slurm-users@lists.schedmd.com wrote:
I am doing a new install of slurm 24.05.3 I have all the packages built and installed on headnode and compute node with the same munge.key, slurm.conf, and gres.conf file. I was able to run munge and unmunge commands to test munge successfully. Time is synced with chronyd. I can’t seem to find any useful errors in the logs. For some reason when I run sinfo no nodes are listed. I just see the headers for each column. Has anyone seen this or know what a next step of troubleshooting would be? I’m new to this and not sure where to go from here. Thanks for any and all help!
The odd output I am seeing [username@headnode ~] sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
(Nothing is output showing status of partition or nodes)
Slurm.conf
ClusterName=slurmkvasir SlurmctldHost=kadmin2 MpiDefault=none ProctrackType=proctrack/cgroup PrologFlags=contain ReturnToService=2 SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/cgroup MinJobAge=600 SchedulerType=sched/backfill SelectType=select/cons_tres PriorityType=priority/multifactor AccountingStorageHost=localhost AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageTRES=gres/gpu,cpu,node JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmLogFile=/var/log/slurm/slurmd.log nodeName=k[001-448] PartitionName=default Nodes=k[001-448] Default=YES MaxTime=INFINITE State=up
Slurmctld.log
Error: Configured MailProg is invalid Slurmctld version 24.05.3 started on cluster slurmkvasir Accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Regisetering slurmctld at port 8617 Error: read_slurm_conf: default partition not set. Revovered state of 448 nodes Down nodes: k[002-448] Recovered information about 0 jobs Revovered state of 0 reservations Read_slurm_conf: backup_controller not specified Select/cons_tres; select_p_reconfigure: select/cons_tres: reconfigure Running as primary controller
Slurmd.log
Error: Node configuration differs from hardware: CPUS=1:40(hw) Boards=1:1(hw) SocketsPerBoard=1:2(hw) CoresPerSocket=1:20(hw) ThreadsPerCore:1:1(hw) CPU frequency setting not configured for this node Slurmd version 24.05.3started Slurmd started on Wed, 27 Nov 2024 06:51:03 -0700 CPUS=1 Boards=1 Cores=1 Threads=1 Memory=192030 TmpDisk=95201 uptime 166740 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) Error: _forward_thread: failed to k019 (10.142.0.119:6818): Connection timed out (Above line repeated 20 or so times for different nodes.)
Thanks,
Kent Hanson
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com