- slurm-users - lists.schedmd.com

Slinky version 1.0.0 is now available
by Marrlow Warnicke 20 Nov '25

20 Nov '25

We are pleased to announce the availability of Slinky version 1.0.0! Slinky is SchedMD's set of components to integrate Slurm in Kubernetes environments. Slinky consists of two main projects, slurm-operator and slurm-bridge. Our landing page is here: https://www.slinky.ai The slurm-operator handles cases where users wish to run Slurm jobs within the Kubernetes cluster. Release v1.0.0: https://github.com/SlinkyProject/slurm-operator/tree/release-1.0 New features include hybrid support, integration points for external tooling, and workload protection and isolation. The full changelog may be found here: https://github.com/SlinkyProject/slurm-operator/blob/main/CHANGELOG/CHANGEL… The slurm-bridge handles the cases where you want Slurm scheduling on your cluster and to be able to run either Kubernetes or Slurm jobs. Release v1.0.0: https://github.com/SlinkyProject/slurm-bridge/tree/release-1.0 New features include support for DRA Extended Resources, support for TaintToleration and VolumeBinding plugins, and integration of new Slurm 25.11 support for granular node resource allocation assignments with GRES. The full changelog may be found here: https://github.com/SlinkyProject/slurm-bridge/blob/main/CHANGELOG/CHANGELOG… The SlinkyProject registry now has containers that support both amd64 (x86_64) and arm64 (aarch64) architectures. You may find these here: https://github.com/orgs/SlinkyProject/packages Apologies for the redundant email-we needed to clarify the release version. -- Marlow Warnicke Principal Cloud Engineer Commercial Slurm - and Slinky - Development and Support

1 0

Slinky version 1.0.0 is now available
by Marrlow Warnicke 20 Nov '25

20 Nov '25

We are pleased to announce the availability of Slinky version 1.0.0-rc1. Slinky is SchedMD’s set of components to integrate Slurm in Kubernetes environments. Slinky consists of two main projects, slurm-operator and slurm-bridge. Our landing page is here: https://www.slinky.ai The slurm-operator handles cases where users wish to run Slurm jobs within the Kubernetes cluster. Release v1.0.0-rc1: https://github.com/SlinkyProject/slurm-operator/tree/release-1.0 New features include hybrid support, integration points for external tooling, and workload protection and isolation. The full changelog may be found here: https://github.com/SlinkyProject/slurm-operator/blob/main/CHANGELOG/CHANGEL… The slurm-bridge handles the cases where you want Slurm scheduling on your cluster and to be able to run either Kubernetes or Slurm jobs. Release v1.0.0-rc1: https://github.com/SlinkyProject/slurm-bridge/tree/release-1.0 New features include support for DRA Extended Resources, support for TaintToleration and VolumeBinding plugins, and integration of new Slurm 25.11 support for granular node resource allocation assignments with GRES. The full changelog may be found here: https://github.com/SlinkyProject/slurm-bridge/blob/main/CHANGELOG/CHANGELOG… The SlinkyProject registry now has containers that support both amd64 (x86_64) and arm64 (aarch64) architectures. You may find these here: https://github.com/orgs/SlinkyProject/packages -- Marlow Warnicke Principal Cloud Engineer, SchedMD LLC Commercial Slurm - and Slinky - Development and Support

1 0

Recommended Stable Slurm Version for >100P Scale Clusters
by KK 18 Nov '25

18 Nov '25

We are currently planning to deploy a new HPC system with a total compute capacity exceeding 100 PF. As part of our preparation, we would like to understand which Slurm versions are considered stable and widely used at this scale. Could you please share your recommendations or experience regarding: 1. Which Slurm version is currently running reliably on very large-scale clusters (>100 PF or >10k nodes)? 2. Whether there are any versions we should avoid due to known issues at large scale. 3. Any best practices or configuration considerations for Slurm deployments of this size.

3 2

Invalid generic resource (gres) specification after RMA
by Lee 17 Nov '25

17 Nov '25

Good afternoon, I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue. *Symptoms : * 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY | grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports " Gres=gpu:*h100*:8(S:0-1)" 2. When I submit a job to this node, I get : $ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification ### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES *AllocTRES=gres/gpu=2* ### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done *Configuration : * 1. gres.conf : # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE 2. grep NodeName slurm.conf root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local 3. What slurmd detects on dgx09 root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10 root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML *Questions : * 1. As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes in terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'? 2. Why is dgx09 not accepting GPU jobs and afterwards it artificially thinks that there are GPUs allocated even though no jobs are on the node? 3. Are there additional tests / configurations that I can do to probe the differences between dgx09 and all my other nodes? Best regards, Lee

3 3

Slurm versions 25.05.5 and 24.11.7 are now available
by Marshall Garey 11 Nov '25

11 Nov '25

We are pleased to announce the availability of Slurm versions 25.05.5 and 24.11.7. These releases fix some mild to moderate severity issues and potential crashes. The full list of changes are available in the CHANGELOG files for each version: https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.md https://github.com/SchedMD/slurm/blob/slurm-24.11/CHANGELOG/slurm-24.11.md Slurm can be downloaded from: https://www.schedmd.com/download-slurm/ -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support

1 0

Split node into 2 partitions
by Ratnasamy, Fritz 11 Nov '25

11 Nov '25

Hi, Is there a straightforward way to split a gpu node with 8 GPUs into 2 partitions with 4 GPUs on each? How do we edit the slurm.conf and gres.conf in this case? Thanks, *Fritz Ratnasamy*Data Scientist Information Technology

3 2

Slurm Community Survey 2025
by Tim Wickberg 10 Nov '25

10 Nov '25

With the Slurm 25.11 release out, and the 1.0.0rc1 for the Slinky Project done, we're quickly shifting into conference season. SchedMD staff are presenting at KubeCon North America on Slinky [1] this week. We don't have a booth here, but feel free to say hi if you see any of us, or send me a message on the Kubernetes or CNCF Slack channels (@wickberg) if you have something you'd like to discuss in person. Next week we'll be manning the Slurm Booth at SC25 [2], as well as hosting the annual Slurm Community Birds-of-a-Feather session [3] on Thursday from 12:15-1:15pm. For this year we wanted to to make the Slurm Community BoF survey available ahead of time, and to open it to a wider audience. This'll let us prepare some initial results ahead of the BoF (while still trying to update live data during the BoF), and the results from this are invaluable as we plan for future Slurm releases. The survey is available now: https://schedmd.com/survey For those not at SC25, we'll have a brief set of highlights from this survey included in the Slurm 25.11 release overview video on our YouTube channel in December. - Tim [1] https://kccncna2025.sched.com/event/27FW5/ [2] The Slurm Booth is #1641. We have a new halo banner this year as well. [3] https://sc25.conference-program.com/presentation/?id=bof101&sess=sess471 [4] https://www.youtube.com/SchedMDSlurm -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support

1 0

sreport and AccountUtilizationByUser
by Gestió Servidors 08 Nov '25

08 Nov '25

Hello, I woud like to get a full report that includes CPU usage of an account (detalied by user). I have run "sreport cluster AccountUtilizationByUser start=9/1 end=11/2 account=mygr" and the output is this: -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2025-09-01T00:00:00 - 2025-11-01T23:59:59 (5360400 secs) Usage reported in CPU Minutes -------------------------------------------------------------------------------- Cluster Account Login Proper Name Used Energy --------- --------------- --------- --------------- --------- -------- mycluster mygr 3979368 0 mycluster mygr mygr-1 3979368 0 mycluster mygr mygr-10 3979368 0 mycluster mygr mygr-11 3979368 0 mycluster mygr mygr-12 3979368 0 mycluster mygr mygr-13 3979368 0 mycluster mygr mygr-14 3979368 0 mycluster mygr mygr-15 3979368 0 mycluster mygr mygr-16 3979368 0 mycluster mygr mygr-17 3979368 0 mycluster mygr mygr-18 3979368 0 mycluster mygr mygr-19 3979368 0 mycluster mygr mygr-2 3979368 0 mycluster mygr mygr-20 3979368 0 mycluster mygr mygr-21 3979368 0 mycluster mygr mygr-22 3979368 0 mycluster mygr mygr-23 3979368 0 mycluster mygr mygr-24 3979368 0 mycluster mygr mygr-25 3979368 0 mycluster mygr mygr-26 3979368 0 mycluster mygr mygr-27 3979368 0 mycluster mygr mygr-28 3979368 0 mycluster mygr mygr-29 3979368 0 mycluster mygr mygr-3 3979368 0 mycluster mygr mygr-30 3979368 0 mycluster mygr mygr-31 3979368 0 mycluster mygr mygr-32 3979368 0 mycluster mygr mygr-33 3979368 0 mycluster mygr mygr-34 3979368 0 mycluster mygr mygr-35 3979368 0 mycluster mygr mygr-36 3979368 0 mycluster mygr mygr-37 3979368 0 mycluster mygr mygr-38 3979368 0 mycluster mygr mygr-39 3979368 0 mycluster mygr mygr-4 3979368 0 mycluster mygr mygr-40 3979368 0 mycluster mygr mygr-5 3979368 0 mycluster mygr mygr-6 3979368 0 mycluster mygr mygr-7 3979368 0 mycluster mygr mygr-8 3979368 0 mycluster mygr mygr-9 3979368 0 I don't understand why column "Used" shows always the same value, because each user "mygr-XX" has used the cluster in differents ways, times and each used has submitted very different number of jobs, as you can see in this "sshare" output: [root@login bin]# sshare -l -A mygr Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- mygr 1 0.250000 3941705 0.337402 0.961033 0.260137 mygr mygr-1 1 0.024390 32099 0.002748 0.008144 0.095146 2.995008 mygr mygr-10 1 0.024390 98343 0.008418 0.024949 0.067961 0.977586 mygr mygr-11 1 0.024390 213838 0.018304 0.054250 0.052427 0.449588 mygr mygr-12 1 0.024390 134149 0.011483 0.034033 0.056311 0.716657 mygr mygr-13 1 0.024390 89304 0.007644 0.022656 0.071845 1.076537 mygr mygr-14 1 0.024390 524836 0.044925 0.133150 0.040777 0.183179 mygr mygr-15 1 0.024390 303494 0.025979 0.076996 0.044660 0.316774 mygr mygr-16 1 0.024390 99020 0.008476 0.025121 0.066019 0.970897 mygr mygr-17 1 0.024390 296341 0.025366 0.075181 0.046602 0.324421 mygr mygr-18 1 0.024390 37809 0.003236 0.009592 0.085437 2.542696 mygr mygr-19 1 0.024390 72158 0.006177 0.018306 0.077670 1.332337 mygr mygr-2 1 0.024390 225658 0.019316 0.057249 0.050485 0.426039 mygr mygr-20 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-21 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-22 1 0.024390 115042 0.009847 0.029186 0.062136 0.835683 mygr mygr-23 1 0.024390 54169 0.004637 0.013743 0.081553 1.774775 mygr mygr-24 1 0.024390 125767 0.010765 0.031907 0.058252 0.764422 mygr mygr-25 1 0.024390 243088 0.020808 0.061671 0.048544 0.395490 mygr mygr-26 1 0.024390 35516 0.003040 0.009010 0.089320 2.706902 mygr mygr-27 1 0.024390 65281 0.005588 0.016562 0.079612 1.472682 mygr mygr-28 1 0.024390 118203 0.010118 0.029988 0.060194 0.813335 mygr mygr-29 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-3 1 0.024390 107543 0.009206 0.027284 0.064078 0.893956 mygr mygr-30 1 0.024390 33946 0.002906 0.008612 0.091262 2.832099 mygr mygr-31 1 0.024390 183059 0.015670 0.046442 0.054369 0.525180 mygr mygr-32 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-33 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-34 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-35 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-36 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-37 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-38 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-39 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-4 1 0.024390 37605 0.003219 0.009541 0.087379 2.556486 mygr mygr-40 1 0.024390 0 0.000000 0.000000 0.118447 inf mygr mygr-5 1 0.024390 81114 0.006943 0.020578 0.073786 1.185232 mygr mygr-6 1 0.024390 91873 0.007864 0.023308 0.069903 1.046428 mygr mygr-7 1 0.024390 80513 0.006892 0.020426 0.075728 1.194082 mygr mygr-8 1 0.024390 367822 0.031485 0.093316 0.042718 0.261374 mygr mygr-9 1 0.024390 33630 0.002879 0.008532 0.093204 2.858713 mygr mygr-tutor 1 0.024390 40472 0.003464 0.010268 0.083495 2.375427 What am I doing wrong? What I need to get is a detalied report from all "mygr" users (mygr account) from CPU usage and, if it is possible, other values. Thanks.

2 1

Slinky version 1.0.0rc1 is now now available
by Marrlow Warnicke 07 Nov '25

07 Nov '25

We are pleased to announce the availability of Slinky version 1.0.0-rc1. Slinky is SchedMD’s set of components to integrate Slurm in Kubernetes environments. Slinky consists of two main projects, slurm-operator and slurm-bridge. Our landing page is here: https://www.slinky.ai The slurm-operator handles cases where users wish to run Slurm jobs within the Kubernetes cluster. Release v1.0.0-rc1: https://github.com/SlinkyProject/slurm-operator/tree/release-1.0 New features include hybrid support, integration points for external tooling, and workload protection and isolation. The full changelog may be found here: https://github.com/SlinkyProject/slurm-operator/blob/main/CHANGELOG/CHANGEL… The slurm-bridge handles the cases where you want Slurm scheduling on your cluster and to be able to run either Kubernetes or Slurm jobs. Release v1.0.0-rc1: https://github.com/SlinkyProject/slurm-bridge/tree/release-1.0 New features include support for DRA Extended Resources, support for TaintToleration and VolumeBinding plugins, and integration of new Slurm 25.11 support for granular node resource allocation assignments with GRES. The full changelog may be found here: https://github.com/SlinkyProject/slurm-bridge/blob/main/CHANGELOG/CHANGELOG… The SlinkyProject registry now has containers that support both amd64 (x86_64) and arm64 (aarch64) architectures. You may find these here: https://github.com/orgs/SlinkyProject/packages -- Marlow Warnicke Principal Cloud Engineer, SchedMD LLC Commercial Slurm - and Slinky - Development and Support

1 0

submit host DNS lookup errors stall slurmctld, fail nodes, kill jobs
by moorehfl＠amazon.com 06 Nov '25

06 Nov '25

As we've scaled up our slurm usage, we've noticed that short, moderate bursts of DNS lookup failures are enough to regularly stall slurmctld: _xgetaddrinfo: getaddrinfo(runner-t35knco5d-project-54-concurrent-0:37251) failed ...this has a cascading effect where, when stalled, the controlled can't always communicate with nodes: error: Error connecting, bad data: family = 0, port = 0 ...and immediately the controller will mark the nodes as unhealthy, and kill jobs: slurmctld: Killing JobId=3120751 on failed node slurm-0f6cacdc1 The reason for the DNS failures is not unreliable DNS server or network, but rather the jobs are submitted by containers that don't have resolvable hostnames. This traditionally hasn't disrupted functionality, but we've noticed if 8-10 jobs all terminate at the same time (submitter container SIGTERMs the srun process) that the controller can be easily overloaded for several seconds, despite having significant free system resources. gdb confirms the process is hanging on DNS. We also can see "Socket timed out on send/recv operation" from clients attempting to interact with the controller during the issue. slurm 24.11.0 RHEL 8.10 kernel 4.18.0-553.58.1.el8_10.x86_64 We're looking into ways to get our ephemeral job submitter containers resolvable in DNS to prevent lookup failures (either by giving them resolvable hostnames, or by blackholing the records to 0.0.0.0 to allow for fast local failure on the slurmctld server). However, it does seem unusual for a handful of bad DNS lookups to cause so much disruption in slurmctld. Is this a known weak point of ctld? The slurmctld host is a single-purpose 16 vCPU 30GB EC2 instance with minimal load. We have ~150 nodes, all nodes have valid IPs in slurm.conf to remove the need for ctld to perform lookups for nodes, but apparently there is still a need to lookup the submit host as well, and we can reliably reproduce these cascading failures. Another possiblity might be to extend SlurmdTimeout to something very long and hope that the controller recovers from its stall in enough time to prevent from marking nodes as unhealthy and killing jobs, but it's not clear if that will have any effect since the first occurrence of "error: Error connecting, bad data: family = 0, port = 0" immediately drains nodes and kills jobs. Thanks

1 0

2025

2024