- slurm-users - lists.schedmd.com

Tracking costs - one single pool of credits, variable costs per partition
by John Snowdon 18 Oct '24

18 Oct '24

We are trying to design the charging and accounting system for our new institutional HPC facility and I'm having difficulty understanding exactly how we can use sacctmgr to achieve what we need. Until now, our previous HPC facilities have all operated as free delivery and we have not needed to track costs by user/group/project. Account codes have been purely optional. However, our new facility will be split into various resource types, with free partitions and paid/priority/reserved partitions across those resource types. All jobs will need to be submitted with an account code. For users submitting to 'free' partitions we don't need to track resource units against a balance, but the submitted account code would still be used for reporting purposes (i.e. "free resources accounted for % of all use by this project in August-September"). When submitting to a 'paid' partition, the account code needs to be checked to ensure it has a positive balance (or a balance that will not go past some negative threshold). Each of the 'paid' partitions may (will) have a different resource unit cost. A simple example: - Submit to a generic CPU paid partition -- 1 resource unit/token/credit/£/$ per allocated cpu, per hour of compute - Submit to a high-speed, non-blocking CPU paid partition -- 2 resource unit/token/credit/£/$ per allocated cpu, per hour of compute - Submit to a GPU paid partition -- 4 resource unit/token/credit/£/$ per allocated GPU card, per hour of compute We need to have *one* pool of resource units/tokens/credits per account - let's say 1000 credits, and a group of users may well decide to spend all of their credits on the generic CPU partition, all on the GPU partition, or some mixture of the two. So in the above examples, assuming one user (or group of users sharing the same account code) submit a 2 hour job to all three partitions, their one, single account code should be charged: - 2 units for the generic CPU partition - 4 units for the job on the low latency partition - 8 units for the gpu partition. - A total of 14 credits removed from their single account code Is this feasible to achieve without having to allocate credits to each of the partitions for an account, or creating a QOS variant for each and every combination of account and partition? John Snowdon Senior Research Infrastructure Engineer (HPC) Research Software Engineering Catalyst Building, Room 2.01 Newcastle University 3 Science Square Newcastle Helix Newcastle upon Tyne NE4 5TG https://rse.ncldata.dev/

2 1

How do you guys track which GPU is used by which job ?
by Sylvain MARET 17 Oct '24

17 Oct '24

Hey guys ! I'm looking to improve GPU monitoring on our cluster. I want to install this https://github.com/NVIDIA/dcgm-exporter and saw in the README that it can support tracking of job id : https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job… However I haven't been able to see any examples on how to do it nor does slurm seem to expose this information by default. Does anyone do this here ? And if so do you have any examples I could try to follow ? If you have advise on best practices to monitor GPU I'd be happy to hear it out ! Regards, Sylvain Maret

5 6

Job information is not being added to accounting database on new setup
by Adrian Brady 17 Oct '24

17 Oct '24

Hi Everyone, I'm a new to slurm administration and looking for a bit of help! Just added Accounting to an existing cluster but job information is not being added to the Accounting Mariadb. When I submit a test job it gets scheduled fine and its visible with squeue, I get nothing returned from sacct! I have turned up the logging to debug5 on both slurmctld and slurmdbd logs and can't see any errors. I believe all the comms are ok between slurmctld and slurmdbd as when I enter the sacct command I can see the database is being queried but returning nothing, because nothing has been added to the tables. The cluster tables were created fine when I ran #sacctmgr add cluster ny5ktt $ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- # tail -f slurmdbd.log [2024-10-17T12:34:45.232] debug: REQUEST_PERSIST_INIT: CLUSTER:ny5ktt VERSION:9216 UID:10001 IP:10.202.233.117 CONN:10 [2024-10-17T12:34:45.232] debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connection 1 [2024-10-17T12:34:45.233] debug2: Attempting to connect to localhost:3306 [2024-10-17T12:34:45.274] debug2: DBD_GET_JOBS_COND: called [2024-10-17T12:34:45.317] debug2: DBD_FINI: CLOSE:1 COMMIT:0 [2024-10-17T12:34:45.317] debug4: accounting_storage/as_mysql: acct_storage_p_commit: got 0 commits The Mariadb is running on it own node with slurmdbd and munged for authentication. I haven't setup any accounts, users, asssociations or enforcements yet. On my lab cluster, jobs were visible in the database without these being setup. I guess I must be missing something simple in the config that is stopping jobs being reported to slurmdbd. Master Node packages # rpm -qa |grep slurm slurm-slurmdbd-20.11.9-1.el8.x86_64 slurm-libs-20.11.9-1.el8.x86_64 slurm-20.11.9-1.el8.x86_64 slurm-slurmd-20.11.9-1.el8.x86_64 slurm-perlapi-20.11.9-1.el8.x86_64 slurm-doc-20.11.9-1.el8.x86_64 slurm-contribs-20.11.9-1.el8.x86_64 slurm-slurmctld-20.11.9-1.el8.x86_64 Database Node packages # rpm -qa |grep slurm slurm-slurmdbd-20.11.9-1.el8.x86_64 slurm-20.11.9-1.el8.x86_64 slurm-libs-20.11.9-1.el8.x86_64 slurm-devel-20.11.9-1.el8.x86_64 slurm.conf # # See the slurm.conf man page for more information. # ClusterName=ny5ktt ControlMachine=ny5-pr-kttslurm-01 ControlAddr=10.202.233.71 #BackupController= #BackupAddr= # AuthType=auth/munge #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=999999 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins= #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar MailProg=/bin/true MaxJobCount=200000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/d SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurm/ctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 #MinJobAge=300 #MinJobAge=43200 # CHG0057915 MinJobAge=14400 # CHG0057915 #MaxJobCount=50000 #MaxJobCount=100000 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING DefMemPerCPU=3000 #FastSchedule=1 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_tres #SelectTypeParameters=CR_Core #SelectTypeParameters=CR_CPU SelectTypeParameters=CR_CPU_Memory # ECR CHG0056915 10/14/2023 MaxArraySize=5001 # # # JOB PRIORITY #PriorityFlags= #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageEnforce=limits AccountingStorageHost=ny5-pr-kttslurmdb-01.ktt.schonfeld.com #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= #AccountingStorageType=accounting_storage/none AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageUser= AccountingStoreJobComment=YES #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= #JobContainerType=job_container/none JobAcctGatherFrequency=60 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log #SlurmdLogFile= #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES ##using fqdn since the ctld domain is different. Can't use regex since it's not at the end ##save 17 and 18 as headnodes #NodeName=ny5-dv-kttres-17 Sockets=1 CoresPerSocket=18 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400 #NodeName=ny5-dv-kttres-18 Sockets=1 CoresPerSocket=14 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400 NodeName=ny5-dv-kttres-19 Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400 NodeName=ny5-dv-kttres-[20-21] Sockets=1 CoresPerSocket=18 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400 NodeName=ny5-dv-kttres-[01-16] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Feature=HyperThread RealMemory=233472 NodeName=ny5-dv-kttres-[22-35] Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 Feature=HyperThread RealMemory=346884 PartitionName=ktt_slurm_light_1 Nodes=ny5-dv-kttres-[19-21] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 PartitionName=ktt_slurm_medium_1 Nodes=ny5-dv-kttres-[01-08] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 PartitionName=ktt_slurm_medium_2 Nodes=ny5-dv-kttres-[09-16] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 PartitionName=ktt_slurm_medium_3 Nodes=ny5-dv-kttres-[22-28] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 PartitionName=ktt_slurm_medium_4 Nodes=ny5-dv-kttres-[29-35] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 PartitionName=ktt_slurm_large_1 Nodes=ny5-dv-kttres-[01-16] Default=YES MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 PartitionName=ktt_slurm_large_2 Nodes=ny5-dv-kttres-[22-35] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2 Slurmdbd.conf AuthType=auth/munge DbdAddr=10.202.233.72 DbdHost=ny5-pr-kttslurmdb-01 DebugLevel=debug5 LogFile=/var/log/slurm/slurmdbd.log PidFile=/tmp/slurmdbd.pid StorageType=accounting_storage/mysql StorageHost=localhost #StorageHost=10.234.132.57 StorageUser=slurm SlurmUser=slurm StoragePass=xxxxxxx #StorageUser=slurm #StorageLoc=slurm_acct_db Database tables MariaDB [slurm_acct_db]> show tables; +--------------------------------+ | Tables_in_slurm_acct_db | +--------------------------------+ | acct_coord_table | | acct_table | | clus_res_table | | cluster_table | | convert_version_table | | federation_table | | ny5ktt_assoc_table | | ny5ktt_assoc_usage_day_table | | ny5ktt_assoc_usage_hour_table | | ny5ktt_assoc_usage_month_table | | ny5ktt_event_table | | ny5ktt_job_table | | ny5ktt_last_ran_table | | ny5ktt_resv_table | | ny5ktt_step_table | | ny5ktt_suspend_table | | ny5ktt_usage_day_table | | ny5ktt_usage_hour_table | | ny5ktt_usage_month_table | | ny5ktt_wckey_table | | ny5ktt_wckey_usage_day_table | | ny5ktt_wckey_usage_hour_table | | ny5ktt_wckey_usage_month_table | | qos_table | | res_table | | table_defs_table | | tres_table | | txn_table | | user_table | +--------------------------------+ Many Thanks Adrian Disclaimer Schonfeld Strategic Advisors (UK) LLP (“SSA UK”) is authorised and regulated by The Financial Conduct Authority. SSA UK is a limited liability partnership in England and Wales (No: OC420598) and its registered office is at 54 Jermyn Street, London, SW1Y 6LX. The contents of this message, including any attachments, are meant solely for the intended recipient and may be confidential, privileged, or otherwise protected from disclosure. If you receive this message in error, immediately alert the sender by reply e-mail, delete it and any attachments or copies from your systems, and do not read, disclose, distribute, or otherwise use the information contained herein. We do not waive any confidentiality or privilege if this message was misdirected. This e-mail does not constitute an offer to sell or a solicitation to buy any securities or an offer of any investment advisory services. If you reply to this email please note that we invest in securities and do not want to receive material, non-public information and you are instructed not to communicate any such information to us. We do not agree to keep confidential any information you provide nor restrict our trading activity, except as agreed pursuant to a written confidentiality agreement duly executed by us. We reserve the right to monitor and review the content of all messages sent to or from this e-mail address.

1 0

Issue with interactive jobs
by Nerjes, Onno 17 Oct '24

17 Oct '24

Dear all, we've set up SLURM 24.05.3 on our cluster and are experiencing an issue with interactive jobs. Before, we used 21.08 and pretty much the same settings, but without these issues. We've started with a fresh DB etc. The behavior of interactive jobs is very erratic. Sometimes they start absolutely fine, at other times they die silently in the background, while the user has to wait indefinitely. We have been unable to isolate certain users or nodes affected by this. On a given node, one user might be able to start an interactive job, while another user at the same time isn't able to. The day after, the situation might be the other way around. The exception are jobs that use a reservation. These start fine every time as far as we can tell. At the same time, the number of idle nodes does not seem to influence the behavior I described above. Failed allocation on the front end: [user1@login1 ~]$ salloc salloc: Pending job allocation 5052052 salloc: job 5052052 queued and waiting for resources The same job on the backend: 2024-10-14 11:41:57.680 slurmctld: _job_complete: JobId=5052052 done 2024-10-14 11:41:57.678 slurmctld: _job_complete: JobId=5052052 WEXITSTATUS 1 2024-10-14 11:41:57.678 slurmctld: Killing interactive JobId=5052052: Communication connection failure 2024-10-14 11:41:46.666 slurmctld: sched/backfill: _start_job: Started JobId=5052052 in devel on m02n01 2024-10-14 11:41:30.096 slurmctld: sched: _slurm_rpc_allocate_resources JobId=5052052 NodeList=(null) usec=6258 Raising the debug level has not brought additional information. We were hoping, that one of you might be able to provide some insight into what the next steps in troubleshooting might be. Best regards, Onno

1 0

Dependency jobs
by adam＠bramblecfd.com 17 Oct '24

17 Oct '24

Hi Slurm Users, I am trying to figure out if there is a way you can check if a running job has any jobs queued up after it that depend on the current running job, I know you can show job info and find what dependency a job is waiting for, But more after checking if there are jobs waiting on the current job to complete using the job ID, Many Thanks, Adam

3 2

AllowAccounts partition setting
by Marko Markoc 16 Oct '24

16 Oct '24

Hi All, At one point, when we were running slurm 22, we set up a few new partitions with `AllowAccounts` setting enabled. We limit access to these partitions only to a few accounts. We manually create these accounts and users using the `sacctmgr` command. At that time, everything was working. Only users in these accounts were able to run jobs in that partition. Few months ago we upgraded slurm to 23.02.7 but just a couple of days ago we noticed that anyone could submit jobs to these partitions. In addition to that, you can provide any value to the `--account` directive and slurm will accept and record it in the job accounting database. In our initial testing, you could only provide `--account` value that has been previously created. I'm still looking into this but I wanted to check if there was any change in version 23 that could have impacted this? I don't remember seeing anything in release notes. Thanks, Marko

6 8

Problem with nodes with 1 gpu
by Jörg Striewski 16 Oct '24

16 Oct '24

i cannot send jobs to nodes with one gpu, i don't find the bug in my configuration. can someone help me ? in slurm.conf GresTypes=gpu is set this are some nodes in slurm.conf NodeName=gpu-[001-003] CPUs=8 SocketsPerBoard=1 CoresPerSocket=4 RealMemory=31000 Gres=gpu:1080:1 NodeName=gpu-[010-019] CPUs=16 SocketsPerBoard=1 CoresPerSocket=8 RealMemory=64000 Gres=gpu:1080:2 the partition for this gpu nodes is # General GPU partitions PartitionName=GPU Nodes=gpu-[001-003,010-019] AllowAccounts=staff PreemptMode=REQUEUE PriorityTier=0 DefMemPerGPU=32000 DefCpuPerGPU=8 CpuBind=none TRESBillingWeights="GRES/gpu=1000" GraceTime=300 this are the entries for some nodes in gres.conf NodeName=gpu-[001-003] Name=gpu Type=1080 File=/dev/nvidia0 NodeName=gpu-[010-019] Name=gpu Type=1080 File=/dev/nvidia[0-1] when i send a job with sbatch to gpu-001 #SBATCH --job-name=hello #SBATCH --ntasks-per-node=1 #SBATCH --output=hello_%A.out #SBATCH --time=00:10:00 #SBATCH --mail-type=ALL #SBATCH --mail-user=striewski(a)ismll.de #SBATCH --partition=GPU #SBATCH --nodelist=gpu-001 #SBATCH --gres=gpu:1 [...] i get the error sbatch: error: Batch job submission failed: Requested node configuration is not available when i send the job to a node with 2 gpu's it runs with no error, just setting --nodelist=gpu-12 has someone a hint what i made wrong ? Mit freundlichen Grüßen / kind regards -- Jörg Striewski Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim Germany post address: Universitätsplatz 1, D-31141Hildesheim, Germany visitor address: Samelsonplatz 1, D-31141 Hildesheim,Germany Tel.(+49) 05121 / 883-40392 http://www.ismll.uni-hildesheim.de

2 2

unsuscribe
by LEROY Christine 208562 15 Oct '24

15 Oct '24

unsuscribe

1 0

How does --nodes=min[-max] determine number of nodes to allocate?
by Declan Valters 08 Oct '24

08 Oct '24

Hi, I am trying to understand how Slurm determines the actual number of nodes to allocate to a job specified with -nodes=min[-max] I've been trying to follow the source code through on Github - could anyone point me in the right area(s)/source files/functions to look at to understand the program logic behind this. Many thanks in advance, Declan The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th' ann an Oilthigh Dh?n ?ideann, cl?raichte an Alba, ?ireamh cl?raidh SC005336.

2 1

Jobs not getting scheduled, no priority calculation, but still in queue?
by Cutts, Tim 07 Oct '24

07 Oct '24

Something odd is going on on our cluster. User has a lot of pending jobs in a job array (a few thousand). squeue -u kmnx005 -r -t PD | head -5 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3045324_875 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit) 3045324_876 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit) 3045324_877 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit) 3045324_878 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit) None are getting scheduled. But when I ask SLURM what that job’s priority is, it produces no output: $ sprio -j 3045324 JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION QOS TRES Any clues what’s going on here? -- Tim Cutts Scientific Computing Platform Lead AstraZeneca Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Catalogue<https://azcollaboration.sharepoint.com/sites/CMU993> | ________________________________ AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com<https://www.astrazeneca.com>

2 3

2026

2025

2024

slurm-users