- slurm-users - lists.schedmd.com

SLURM Telegraf Plugin
by Pablo Collado Soto 25 Sep '24

25 Sep '24

Hi all, I recently wrote an SLURM input plugin [0] for Telegraf [1]. I just wanted to let the community know so that you can use it if you'd find that useful. Maybe its existence can also be included in the documentation somewhere? Anyway, thanks a ton for your time, Pablo Collado Soto References: 0: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/slurm 1: https://www.influxdata.com/time-series-platform/telegraf/ + -------------------------------------- + | … [View More]

2 1

Setting up fairshare accounting
by tluchko 24 Sep '24

24 Sep '24

Hello, We have a new cluster and I'm trying to setup fairshare accounting. I'm trying to track CPU, MEM and GPU. It seems that billing for individual jobs is correct, but billing isn't being accumulated (TRESRunMin is always 0). In my slurm.conf, I think the relevant lines are AccountingStorageType=accounting_storage/slurmdbd AccountingStorageTRES=gres/gpu PriorityFlags=MAX_TRES PartitionName=gpu Nodes=node[1-7] MaxCPUsPerNode=384 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.… [View More]125G,GRES/gpu=9.6" PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6" I currently have one recently finished job and one running job. sacct gives $ sacct --format=JobID,JobName,ReqTRES%50,AllocTRES%50,TRESUsageInAve%50,TRESUsageInMax%50 JobID JobName ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax ------------ ---------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- 154 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 billing=9,cpu=2,gres/gpu=1,mem=2G,node=1 154.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1 cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+ cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+ 155 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 billing=9,cpu=2,gres/gpu=1,mem=2G,node=1155.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1 billing=9 seems correct to me, since I have 1 GPU allocated, which has the largest score of 9.6. However, sshare doesn't show anything in TRESRunMins sshare --format=Account,User,RawShares,FairShare,RawUsage,EffectvUsage,TRESRunMins%110 Account User RawShares FairShare RawUsage EffectvUsage TRESRunMins -------------------- ---------- ---------- ---------- ----------- ------------- -------------------------------------------------------------------------------------------------------------- root 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0 abrol_group 2000 0 0.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0 luchko_group 2000 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0 luchko_group tluchko 1 0.333333 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0 Why is TRESRunMin all 0 but RawUsage is not for tluchko? I have checked and slurmdbd is running. Thank you, Tyler Sent with [Proton Mail](https://proton.me/) secure email. [View Less]

1 2

SLUG'24 presentation slides?
by Kilian Cavalotti 23 Sep '24

23 Sep '24

Hi SchedMD, I'm sure they will eventually, but do you know when the slides of the SLUG'24 presentation will be available online at https://slurm.schedmd.com/publications.html, like previous editions'? Thanks! -- Kilian

1 0

what updates NODEADDR
by Jakub Szarlat 21 Sep '24

21 Sep '24

Hi I'm using dynamic nodes with "slurmd -Z" with slurm 23.11.1. Firstly I find that when you do "scontrol show node" it shows the NODEADDR as ip rather than the NODENAME. Because I'm playing around with running this in containers on docker swarm I find this ip can be wrong. I can force it with scontrol update however after a while something updates it to something else again. Does anybody know if this is done by slurmd or slurmctld or something else? How can I stop this from happening? How can … [View More]

2 1

SLURM GRES reservation not working properly on 24.05.1
by Minulakshmi S 20 Sep '24

20 Sep '24

Hello, *Issue 1:* I am using slurm version 24.05.1 , my slurmd has a single node where I connect multiple gres by enabling the overscribe feature. I am able to use the advance reservation of gres only using *gres** name* (tres=gres/gpu:*SYSTEM12*). i.e while in reservation period , if other users submits job with SYSTEM12 , then slurm places this job in queue *user1@host$ srun --gres=gpu:SYSTEM12:1 hostname* *srun: job 333 queued and waiting for resources * but when other users just submit … [View More]

1 0

Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING
by Xaver Stiensmeier 20 Sep '24

20 Sep '24

Dear slurm-user list, I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout is reached and the node is therefore powered down. We have set ReturnToService=2 in order to avoid the node being marked down, because the instance behind that node is created on demand and therefore after a failure nothing stops the system to start the node again as it is a different instance. I thought this would be enough, but apparently the node is still marked … [View More]

2 3

SlurmDBD errors
by Sajesh Singh 19 Sep '24

19 Sep '24

OS: CentOS 8.5 Slurm: 22.05 Recently upgraded to 22.05. Upgrade was successful, but after a while I started to see the following messages in the slurmdbd.log file: error: We have more time than is possible (9344745+7524000+0)(16868745) > 12362400 for cluster CLUSTERNAME(3434) from 2024-09-18T13:00:00 - 2024-09-18T14:00:00 tres 1 (this may happen if oversubscription of resources is allowed without Gang) We do have partitions with overlapping nodes, but do not have "Suspend,Gang" set as the … [View More]

2 2

Change a job from --exclusive to --exclusive=user
by Gerhard Strangar 18 Sep '24

18 Sep '24

Hello, is it possible to change a pending job from --exclusive to --exclusive=user? I tried scontrol update jobid=... oversubscribe=user, but it seems to only accept yes or no. Gerhard

1 0

Feature request: Max Jobs Per Minute
by Ransom, Geoffrey M. 16 Sep '24

16 Sep '24

Hello We have another batch of new users and some more batches of large array jobs with very short runtimes due to errors in the jobs or just by design. Trying to deal with these issues, Setting ArrayTaskThrottle and user education, I had a thought that it would be very nice to have a limit on how many jobs can start in a given minute for users, so if they posted a 200000 array job with 15 second tasks then the scheduler wouldn't launch more than a 100 or 200 per minute and be less likely to … [View More]

2 2

Detailed locations for SLUG'24
by Bjørn-Helge Mevik 10 Sep '24

10 Sep '24

Dear all SLUG attendees! The information about which buildings/addresses the SLUG reception and presentations are to be held is not very visible on the https://slug24.splashthat.com. There is a map there with all locations (https://www.google.com/maps/d/u/0/edit?mid=1bcGaTiW0TNB5noQsjQ3ulctzKuqlGrQ…), but I've gotten questions about it, so: The reception on Wednesday will be held in the top floor of Oslo Science Park (Forskningsparken). Address: Gaustadalléen 21. There will be someone in … [View More]

1 1

2025

2024

slurm-users ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users