Hi Everyone,
I'm a new to slurm administration and looking for a bit of help!
Just added Accounting to an existing cluster but job information is not being added to the Accounting Mariadb. When I submit a test job it gets scheduled fine and its visible with squeue, I get nothing returned from sacct!
I have turned up the logging to debug5 on both slurmctld and slurmdbd logs and can't see any errors. I believe all the comms are ok between slurmctld and slurmdbd as when I enter the sacct command I can see the database is being queried but returning nothing, because nothing has been added to the tables. The cluster tables were created fine when I ran
#sacctmgr add cluster ny5ktt
$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- --------
# tail -f slurmdbd.log [2024-10-17T12:34:45.232] debug: REQUEST_PERSIST_INIT: CLUSTER:ny5ktt VERSION:9216 UID:10001 IP:10.202.233.117 CONN:10 [2024-10-17T12:34:45.232] debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connection 1 [2024-10-17T12:34:45.233] debug2: Attempting to connect to localhost:3306 [2024-10-17T12:34:45.274] debug2: DBD_GET_JOBS_COND: called [2024-10-17T12:34:45.317] debug2: DBD_FINI: CLOSE:1 COMMIT:0 [2024-10-17T12:34:45.317] debug4: accounting_storage/as_mysql: acct_storage_p_commit: got 0 commits
The Mariadb is running on it own node with slurmdbd and munged for authentication. I haven't setup any accounts, users, asssociations or enforcements yet. On my lab cluster, jobs were visible in the database without these being setup. I guess I must be missing something simple in the config that is stopping jobs being reported to slurmdbd.
Master Node packages # rpm -qa |grep slurm slurm-slurmdbd-20.11.9-1.el8.x86_64 slurm-libs-20.11.9-1.el8.x86_64 slurm-20.11.9-1.el8.x86_64 slurm-slurmd-20.11.9-1.el8.x86_64 slurm-perlapi-20.11.9-1.el8.x86_64 slurm-doc-20.11.9-1.el8.x86_64 slurm-contribs-20.11.9-1.el8.x86_64 slurm-slurmctld-20.11.9-1.el8.x86_64
Database Node packages # rpm -qa |grep slurm slurm-slurmdbd-20.11.9-1.el8.x86_64 slurm-20.11.9-1.el8.x86_64 slurm-libs-20.11.9-1.el8.x86_64 slurm-devel-20.11.9-1.el8.x86_64
slurm.conf # # See the slurm.conf man page for more information. # ClusterName=ny5ktt ControlMachine=ny5-pr-kttslurm-01 ControlAddr=10.202.233.71 #BackupController= #BackupAddr= # AuthType=auth/munge #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=999999 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins= #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar MailProg=/bin/true MaxJobCount=200000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/d SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurm/ctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 #MinJobAge=300 #MinJobAge=43200 # CHG0057915 MinJobAge=14400 # CHG0057915 #MaxJobCount=50000 #MaxJobCount=100000 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING DefMemPerCPU=3000 #FastSchedule=1 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_tres #SelectTypeParameters=CR_Core #SelectTypeParameters=CR_CPU SelectTypeParameters=CR_CPU_Memory # ECR CHG0056915 10/14/2023 MaxArraySize=5001 # # # JOB PRIORITY #PriorityFlags= #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageEnforce=limits AccountingStorageHost=ny5-pr-kttslurmdb-01.ktt.schonfeld.com #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= #AccountingStorageType=accounting_storage/none AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageUser= AccountingStoreJobComment=YES #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= #JobContainerType=job_container/none JobAcctGatherFrequency=60 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log #SlurmdLogFile= #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES ##using fqdn since the ctld domain is different. Can't use regex since it's not at the end ##save 17 and 18 as headnodes #NodeName=ny5-dv-kttres-17 Sockets=1 CoresPerSocket=18 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400 #NodeName=ny5-dv-kttres-18 Sockets=1 CoresPerSocket=14 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400 NodeName=ny5-dv-kttres-19 Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400 NodeName=ny5-dv-kttres-[20-21] Sockets=1 CoresPerSocket=18 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400 NodeName=ny5-dv-kttres-[01-16] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Feature=HyperThread RealMemory=233472 NodeName=ny5-dv-kttres-[22-35] Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 Feature=HyperThread RealMemory=346884 PartitionName=ktt_slurm_light_1 Nodes=ny5-dv-kttres-[19-21] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 PartitionName=ktt_slurm_medium_1 Nodes=ny5-dv-kttres-[01-08] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 PartitionName=ktt_slurm_medium_2 Nodes=ny5-dv-kttres-[09-16] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 PartitionName=ktt_slurm_medium_3 Nodes=ny5-dv-kttres-[22-28] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 PartitionName=ktt_slurm_medium_4 Nodes=ny5-dv-kttres-[29-35] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 PartitionName=ktt_slurm_large_1 Nodes=ny5-dv-kttres-[01-16] Default=YES MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 PartitionName=ktt_slurm_large_2 Nodes=ny5-dv-kttres-[22-35] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
Slurmdbd.conf AuthType=auth/munge DbdAddr=10.202.233.72 DbdHost=ny5-pr-kttslurmdb-01 DebugLevel=debug5 LogFile=/var/log/slurm/slurmdbd.log PidFile=/tmp/slurmdbd.pid StorageType=accounting_storage/mysql StorageHost=localhost #StorageHost=10.234.132.57 StorageUser=slurm SlurmUser=slurm StoragePass=xxxxxxx #StorageUser=slurm #StorageLoc=slurm_acct_db
Database tables
MariaDB [slurm_acct_db]> show tables; +--------------------------------+ | Tables_in_slurm_acct_db | +--------------------------------+ | acct_coord_table | | acct_table | | clus_res_table | | cluster_table | | convert_version_table | | federation_table | | ny5ktt_assoc_table | | ny5ktt_assoc_usage_day_table | | ny5ktt_assoc_usage_hour_table | | ny5ktt_assoc_usage_month_table | | ny5ktt_event_table | | ny5ktt_job_table | | ny5ktt_last_ran_table | | ny5ktt_resv_table | | ny5ktt_step_table | | ny5ktt_suspend_table | | ny5ktt_usage_day_table | | ny5ktt_usage_hour_table | | ny5ktt_usage_month_table | | ny5ktt_wckey_table | | ny5ktt_wckey_usage_day_table | | ny5ktt_wckey_usage_hour_table | | ny5ktt_wckey_usage_month_table | | qos_table | | res_table | | table_defs_table | | tres_table | | txn_table | | user_table | +--------------------------------+
Many Thanks
Adrian
Disclaimer
Schonfeld Strategic Advisors (UK) LLP (“SSA UK”) is authorised and regulated by The Financial Conduct Authority. SSA UK is a limited liability partnership in England and Wales (No: OC420598) and its registered office is at 54 Jermyn Street, London, SW1Y 6LX. The contents of this message, including any attachments, are meant solely for the intended recipient and may be confidential, privileged, or otherwise protected from disclosure. If you receive this message in error, immediately alert the sender by reply e-mail, delete it and any attachments or copies from your systems, and do not read, disclose, distribute, or otherwise use the information contained herein. We do not waive any confidentiality or privilege if this message was misdirected. This e-mail does not constitute an offer to sell or a solicitation to buy any securities or an offer of any investment advisory services. If you reply to this email please note that we invest in securities and do not want to receive material, non-public information and you are instructed not to communicate any such information to us. We do not agree to keep confidential any information you provide nor restrict our trading activity, except as agreed pursuant to a written confidentiality agreement duly executed by us. We reserve the right to monitor and review the content of all messages sent to or from this e-mail address.