Hi Everyone,
I’m a new to slurm administration and looking for a bit of help!
Just added Accounting to an existing cluster but job information is not being added to the Accounting Mariadb. When I submit a test job it gets scheduled fine and its visible with squeue, I get nothing returned from sacct!
I have turned up the logging to debug5 on both slurmctld and slurmdbd logs and can’t see any errors. I believe all the comms are ok between slurmctld and slurmdbd as when I enter the sacct command I can see the database
is being queried but returning nothing, because nothing has been added to the tables. The cluster tables were created fine when I ran
#sacctmgr add cluster ny5ktt
$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
# tail -f slurmdbd.log
[2024-10-17T12:34:45.232] debug: REQUEST_PERSIST_INIT: CLUSTER:ny5ktt VERSION:9216 UID:10001 IP:10.202.233.117 CONN:10
[2024-10-17T12:34:45.232] debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connection 1
[2024-10-17T12:34:45.233] debug2: Attempting to connect to localhost:3306
[2024-10-17T12:34:45.274] debug2: DBD_GET_JOBS_COND: called
[2024-10-17T12:34:45.317] debug2: DBD_FINI: CLOSE:1 COMMIT:0
[2024-10-17T12:34:45.317] debug4: accounting_storage/as_mysql: acct_storage_p_commit: got 0 commits
The Mariadb is running on it own node with slurmdbd and munged for authentication. I haven’t setup any accounts, users, asssociations or enforcements yet. On my lab cluster, jobs were visible in the database without these
being setup. I guess I must be missing something simple in the config that is stopping jobs being reported to slurmdbd.
Master Node packages
# rpm -qa |grep slurm
slurm-slurmdbd-20.11.9-1.el8.x86_64
slurm-libs-20.11.9-1.el8.x86_64
slurm-20.11.9-1.el8.x86_64
slurm-slurmd-20.11.9-1.el8.x86_64
slurm-perlapi-20.11.9-1.el8.x86_64
slurm-doc-20.11.9-1.el8.x86_64
slurm-contribs-20.11.9-1.el8.x86_64
slurm-slurmctld-20.11.9-1.el8.x86_64
Database Node packages
# rpm -qa |grep slurm
slurm-slurmdbd-20.11.9-1.el8.x86_64
slurm-20.11.9-1.el8.x86_64
slurm-libs-20.11.9-1.el8.x86_64
slurm-devel-20.11.9-1.el8.x86_64
slurm.conf
#
# See the slurm.conf man page for more information.
#
ClusterName=ny5ktt
ControlMachine=ny5-pr-kttslurm-01
ControlAddr=10.202.233.71
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
MailProg=/bin/true
MaxJobCount=200000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
#MinJobAge=300
#MinJobAge=43200
# CHG0057915
MinJobAge=14400
# CHG0057915
#MaxJobCount=50000
#MaxJobCount=100000
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
DefMemPerCPU=3000
#FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#SelectTypeParameters=CR_Core
#SelectTypeParameters=CR_CPU
SelectTypeParameters=CR_CPU_Memory
# ECR CHG0056915 10/14/2023
MaxArraySize=5001
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageEnforce=limits
AccountingStorageHost=ny5-pr-kttslurmdb-01.ktt.schonfeld.com
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
#AccountingStorageType=accounting_storage/none
AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageUser=
AccountingStoreJobComment=YES
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=60
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
##using fqdn since the ctld domain is different. Can't use regex since it's not at the end
##save 17 and 18 as headnodes
#NodeName=ny5-dv-kttres-17 Sockets=1 CoresPerSocket=18 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400
#NodeName=ny5-dv-kttres-18 Sockets=1 CoresPerSocket=14 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400
NodeName=ny5-dv-kttres-19 Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400
NodeName=ny5-dv-kttres-[20-21] Sockets=1 CoresPerSocket=18 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400
NodeName=ny5-dv-kttres-[01-16] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Feature=HyperThread RealMemory=233472
NodeName=ny5-dv-kttres-[22-35] Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 Feature=HyperThread RealMemory=346884
PartitionName=ktt_slurm_light_1 Nodes=ny5-dv-kttres-[19-21] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
PartitionName=ktt_slurm_medium_1 Nodes=ny5-dv-kttres-[01-08] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
PartitionName=ktt_slurm_medium_2 Nodes=ny5-dv-kttres-[09-16] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
PartitionName=ktt_slurm_medium_3 Nodes=ny5-dv-kttres-[22-28] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
PartitionName=ktt_slurm_medium_4 Nodes=ny5-dv-kttres-[29-35] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
PartitionName=ktt_slurm_large_1 Nodes=ny5-dv-kttres-[01-16] Default=YES MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
PartitionName=ktt_slurm_large_2 Nodes=ny5-dv-kttres-[22-35] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
Slurmdbd.conf
AuthType=auth/munge
DbdAddr=10.202.233.72
DbdHost=ny5-pr-kttslurmdb-01
DebugLevel=debug5
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/tmp/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
#StorageHost=10.234.132.57
StorageUser=slurm
SlurmUser=slurm
StoragePass=xxxxxxx
#StorageUser=slurm
#StorageLoc=slurm_acct_db
Database tables
MariaDB [slurm_acct_db]> show tables;
+--------------------------------+
| Tables_in_slurm_acct_db |
+--------------------------------+
| acct_coord_table |
| acct_table |
| clus_res_table |
| cluster_table |
| convert_version_table |
| federation_table |
| ny5ktt_assoc_table |
| ny5ktt_assoc_usage_day_table |
| ny5ktt_assoc_usage_hour_table |
| ny5ktt_assoc_usage_month_table |
| ny5ktt_event_table |
| ny5ktt_job_table |
| ny5ktt_last_ran_table |
| ny5ktt_resv_table |
| ny5ktt_step_table |
| ny5ktt_suspend_table |
| ny5ktt_usage_day_table |
| ny5ktt_usage_hour_table |
| ny5ktt_usage_month_table |
| ny5ktt_wckey_table |
| ny5ktt_wckey_usage_day_table |
| ny5ktt_wckey_usage_hour_table |
| ny5ktt_wckey_usage_month_table |
| qos_table |
| res_table |
| table_defs_table |
| tres_table |
| txn_table |
| user_table |
+--------------------------------+
Many Thanks
Adrian
Disclaimer
Schonfeld Strategic Advisors (UK) LLP (“SSA UK”) is authorised and regulated by The Financial Conduct Authority. SSA UK is a limited liability partnership in England and Wales (No: OC420598) and its registered office is at 78 St. James's Street, London, SW1A 1JB. The contents of this message, including any attachments, are meant solely for the intended recipient and may be confidential, privileged, or otherwise protected from disclosure. If you receive this message in error, immediately alert the sender by reply e-mail, delete it and any attachments or copies from your systems, and do not read, disclose, distribute, or otherwise use the information contained herein. We do not waive any confidentiality or privilege if this message was misdirected. This e-mail does not constitute an offer to sell or a solicitation to buy any securities or an offer of any investment advisory services. If you reply to this email please note that we invest in securities and do not want to receive material, non-public information and you are instructed not to communicate any such information to us. We do not agree to keep confidential any information you provide nor restrict our trading activity, except as agreed pursuant to a written confidentiality agreement duly executed by us. We reserve the right to monitor and review the content of all messages sent to or from this e-mail address.