Hi Kent,
on your management node could you run: systemctl status slurmctld
and check your 'Nodename=....' and 'PartitionName=...' in /etc/slurm.conf ? In my slurm.conf I have a more detailed description and the Nodename Keyword start with an upper case (do'nt know if slurm.conf is case sensitive) :
NodeName=kareline-0-[0-3] Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=47900
and it looks like your nodes description is not understood by slurm.
Patrick
Le 27/11/2024 à 17:46, Ryan Novosielski via slurm-users a écrit :
At this point, I’d probably crank up the logging some and see what it’s saying in slurmctld.log.
-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'
On Nov 27, 2024, at 11:38, Kent L. Hanson Kent.Hanson@inl.gov wrote:
Hey Ryan, I have restarted the slurmctld and slurmd services several times. I hashed the slurm.conf files. They are the same. I ran “sinfo -a” as root with the same result. Thanks,
Kent *From:*Ryan Novosielski novosirj@rutgers.edu *Sent:*Wednesday, November 27, 2024 9:31 AM *To:*Kent L. Hanson Kent.Hanson@inl.gov *Cc:*slurm-users@lists.schedmd.com *Subject:*Re: [slurm-users] sinfo not listing any partitions If you’re sure you’ve restarted everything after the config change, are you also sure that you don’t have that stuff hidden from your current user? You can try -a to rule that out. Or run as root. -- #BlackLivesMatter ____ || \UTGERS file://utgers/, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'
On Nov 27, 2024, at 09:56, Kent L. Hanson via slurm-users <slurm-users@lists.schedmd.com> wrote: I am doing a new install of slurm 24.05.3 I have all the packages built and installed on headnode and compute node with the same munge.key, slurm.conf, and gres.conf file. I was able to run munge and unmunge commands to test munge successfully. Time is synced with chronyd. I can’t seem to find any useful errors in the logs. For some reason when I run sinfo no nodes are listed. I just see the headers for each column. Has anyone seen this or know what a next step of troubleshooting would be? I’m new to this and not sure where to go from here. Thanks for any and all help! The odd output I am seeing [username@headnode ~] sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST */(Nothing is output showing status of partition or nodes)/* Slurm.conf ClusterName=slurmkvasir SlurmctldHost=kadmin2 MpiDefault=none ProctrackType=proctrack/cgroup PrologFlags=contain ReturnToService=2 SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/cgroup MinJobAge=600 SchedulerType=sched/backfill SelectType=select/cons_tres PriorityType=priority/multifactor AccountingStorageHost=localhost AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageTRES=gres/gpu,cpu,node JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmLogFile=/var/log/slurm/slurmd.log nodeName=k[001-448] PartitionName=default Nodes=k[001-448] Default=YES MaxTime=INFINITE State=up Slurmctld.log Error: Configured MailProg is invalid Slurmctld version 24.05.3 started on cluster slurmkvasir Accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Regisetering slurmctld at port 8617 Error: read_slurm_conf: default partition not set. Revovered state of 448 nodes Down nodes: k[002-448] Recovered information about 0 jobs Revovered state of 0 reservations Read_slurm_conf: backup_controller not specified Select/cons_tres; select_p_reconfigure: select/cons_tres: reconfigure Running as primary controller Slurmd.log Error: Node configuration differs from hardware: CPUS=1:40(hw) Boards=1:1(hw) SocketsPerBoard=1:2(hw) CoresPerSocket=1:20(hw) ThreadsPerCore:1:1(hw) CPU frequency setting not configured for this node Slurmd version 24.05.3started Slurmd started on Wed, 27 Nov 2024 06:51:03 -0700 CPUS=1 Boards=1 Cores=1 Threads=1 Memory=192030 TmpDisk=95201 uptime 166740 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) Error: _/forward/_thread: failed to k019 (10.142.0.119:6818): Connection timed out */(Above line repeated 20 or so times for different nodes.)/* *//* Thanks, Kent Hanson -- slurm-users mailing list --slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com> To unsubscribe send an email toslurm-users-leave@lists.schedmd.com <mailto:slurm-users-leave@lists.schedmd.com>