We are currently planning to deploy a new HPC system with a total compute capacity exceeding 100 PF. As part of our preparation, we would like to understand which Slurm versions are considered stable and widely used at this scale.
Could you please share your recommendations or experience regarding:
1. Which Slurm version is currently running reliably on very large-scale clusters (>100 PF or >10k nodes)?
2. Whether there are any versions we should avoid due to known issues at large scale.
3. Any best practices or configuration considerations for Slurm deployments of this size.
I would take a step back and ask how you intend to install and manage this cluster.
CPU only or GPUs ? OS ? Interconnect fabric? Storage ?
Power per rack? Cooling? Monitoring?
On Sun, Nov 16, 2025, 2:39 PM KK via slurm-users < slurm-users@lists.schedmd.com> wrote:
We are currently planning to deploy a new HPC system with a total compute capacity exceeding 100 PF. As part of our preparation, we would like to understand which Slurm versions are considered stable and widely used at this scale.
Could you please share your recommendations or experience regarding:
- Which Slurm version is currently running reliably on very large-scale
clusters (>100 PF or >10k nodes)?
- Whether there are any versions we should avoid due to known issues at
large scale.
- Any best practices or configuration considerations for Slurm
deployments of this size.
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com