We'll have a bit more details as conference season quickly approaches
this November, but SchedMD staff are presenting at KubeCon NA on Slinky
[1]. We'll be manning the Slurm Booth at SC25 [2], as well as hosting
the annual Slurm Community Birds-of-a-Feather session [3].
I'll also send a link out to the survey questions for the BoF to the
slurm-users list ahead of the conference, and we'll be going into a bit
more depth on the answers during the BoF this year.
The events page on the SchedMD website has more detail on future events
as well: https://www.schedmd.com/events/
A few folks had asked, and apparently we never mentioned this more
publicly, but: SchedMD does not plan to hold an in-person SLUG in 2025
or 2026. We are working to bring some of the same content to our YouTube
channel [4] as a way to more broadly disseminate some of that same
content, starting with the Slurm 25.11 release overview in December.
- Tim
[1] https://kccncna2025.sched.com/event/27FW5/
[2] The Slurm Booth is #1641.
[3] https://sc25.conference-program.com/presentation/?id=bof101&sess=sess471
[4] https://www.youtube.com/SchedMDSlurm
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
We are pleased to announce the availability of Slurm version 25.05.4.
This version increases the default number of maximum connections to
slurmctld from 50 to 512, fixes a regression added in 25.05.2 that broke
compatibility with PMIx v2.x through v3.1.0rc1, and fixes other minor to
moderate bugs.
The full list of changes are available in the CHANGELOG:
https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.md
Slurm can be downloaded from:
https://www.schedmd.com/download-slurm/
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
We are pleased to announce the availability of Slurm version 25.05.3.
This version fixes an issue that prevented deleting a QOS when running
with MySQL servers (MariaDB is was unaffected). Please note that the
slurmdbd will require MySQL 8.0.4+ or MariaDB 10.0.5+ to function
correctly. This version also fixes heterogeneous jobs when TLS is
enabled, a logging issue with syslog, and various mild to moderate
stability fixes.
The full list of changes are available in the CHANGELOG:
https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.md
Slurm can be downloaded from:
https://www.schedmd.com/download-slurm/
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
We are pleased to announce the availability of Slurm version 25.05.2.
This version fixes a few regressions with x11 forwarding in 25.05 that
may prevent applications from launching, adds support for PMIx v6.x,
fixes a variety of stability issues, fixes a regression where
--tres-per-task was ignored, and fixes additional minor to moderate
severity issues.
The full list of changes are available in the CHANGELOG:
https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.md
Slurm can be downloaded from:
https://www.schedmd.com/download-slurm/
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
We are pleased to announce the availability of Slurm versions 25.05.1
and 24.11.6.
Changes in 25.05 include the following:
* Fix many issues with the TLS Certificate Manager introduced in 25.05
* Optimize account deletion
* Fix a bug when reordering the association hierarchy
* Fix some issues that cause daemon crashes
* Fix a variety of memory leaks
Changes in 24.11 include the following:
* Fix some issues that cause daemons to crash
* Fix some race conditions on shutdown that cause daemons to crash or hang
The full list of changes are available in the CHANGELOG for each version:
https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.mdhttps://github.com/SchedMD/slurm/blob/slurm-24.11/CHANGELOG/slurm-24.11.md
Slurm can be downloaded from:
https://www.schedmd.com/download-slurm/
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
We are pleased to announce the availability of Slurm release candidate
25.05.0rc1.
To highlight some new features coming in 25.05:
- Support for defining multiple topology configurations, and varying
them by partition.
- Support for tracking and allocating hierarchical resources.
- Dynamic nodes can be dynamically added to the topology.
- topology/block - Allow for gaps in the block layout.
- Support for encrypting all network communication with TLS.
- jobcomp/kafka - Optionally send job info at job start as well as job end.
- Support an OR operator in --license requests.
- switch/hpe_slingshot - Support for > 252 ranks per node.
- switch/hpe_slingshot - Support mTLS authentication to the fabric manager.
- sacctmgr - Add support for dumping and loading QOSes.
- srun - Add new --wait-for-children option to keep the step running
until all launched processes have been launched (cgroup/v2 only).
- slurmrestd - Add new endpoint for creating reservations.
This is the first release candidate of the upcoming 25.05 release
series, and represents the end of development for this release, and a
finalization of the RPC and state file formats.
If any issues are identified with this release candidate, please report
them through https://bugs.schedmd.com against the 25.05.x version and we
will address them before the first production 25.05.0 release is made.
Please note that the release candidates are not intended for production use.
A preview of the updated documentation can be found at
https://slurm.schedmd.com/archive/slurm-master/ .
Slurm can be downloaded from https://www.schedmd.com/download-slurm/.
The changelog for 25.05.0rc1 can be found here:
https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-25.05.md#chang…
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
Slurm versions 24.11.5, 24.05.8, and 23.11.11 are now available and
include a fix for a recently discovered security issue.
SchedMD customers were informed on April 23rd and provided a patch on
request; this process is documented in our security policy. [1]
A mistake with permission handling for Coordinators within Slurm's
accounting system can allow a Coordinator to promote a user to
Administrator. (CVE-2025-43904)
Thank you to Sekou Diakite (HPE) for reporting this.
Downloads are available at https://www.schedmd.com/downloads.php .
Release notes follow below.
- Tim
[1] https://www.schedmd.com/security-policy/
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
> * Changes in Slurm 24.11.5
> ==========================
> -- Return error to scontrol reboot on bad nodelists.
> -- slurmrestd - Report an error when QOS resolution fails for v0.0.40
> endpoints.
> -- slurmrestd - Report an error when QOS resolution fails for v0.0.41
> endpoints.
> -- slurmrestd - Report an error when QOS resolution fails for v0.0.42
> endpoints.
> -- data_parser/v0.0.42 - Added +inline_enums flag which modifies the
> output when generating OpenAPI specification. It causes enum arrays to not
> be defined in their own schema with references ($ref) to them. Instead they
> will be dumped inline.
> -- Fix binding error with tres-bind map/mask on partial node allocations.
> -- Fix stepmgr enabled steps being able to request features.
> -- Reject step creation if requested feature is not available in job.
> -- slurmd - Restrict listening for new incoming RPC requests further into
> startup.
> -- slurmd - Avoid auth/slurm related hangs of CLI commands during startup
> and shutdown.
> -- slurmctld - Restrict processing new incoming RPC requests further into
> startup. Stop processing requests sooner during shutdown.
> -- slurmcltd - Avoid auth/slurm related hangs of CLI commands during
> startup and shutdown.
> -- slurmctld: Avoid race condition during shutdown or reconfigure that
> could result in a crash due delayed processing of a connection while
> plugins are unloaded.
> -- Fix small memleak when getting the job list from the database.
> -- Fix incorrect printing of % escape characters when printing stdio
> fields for jobs.
> -- Fix padding parsing when printing stdio fields for jobs.
> -- Fix printing %A array job id when expanding patterns.
> -- Fix reservations causing jobs to be held for Bad Constraints
> -- switch/hpe_slingshot - Prevent potential segfault on failed curl
> request to the fabric manager.
> -- Fix printing incorrect array job id when expanding stdio file names.
> The %A will now be substituted by the correct value.
> -- Fix printing incorrect array job id when expanding stdio file names.
> The %A will now be substituted by the correct value.
> -- switch/hpe_slingshot - Fix vni range not updating on slurmctld restart
> or reconfigre.
> -- Fix steps not being created when using certain combinations of -c and
> -n inferior to the jobs requested resources, when using stepmgr and nodes
> are configured with CPUs == Sockets*CoresPerSocket.
> -- Permit configuring the number of retry attempts to destroy CXI service
> via the new destroy_retries SwitchParameter.
> -- Do not reset memory.high and memory.swap.max in slurmd startup or
> reconfigure as we are never really touching this in slurmd.
> -- Fix reconfigure failure of slurmd when it has been started manually and
> the CoreSpecLimits have been removed from slurm.conf.
> -- Set or reset CoreSpec limits when slurmd is reconfigured and it was
> started with systemd.
> -- switch/hpe-slingshot - Make sure the slurmctld can free step VNIs after
> the controller restarts or reconfigures while the job is running.
> -- Fix backup slurmctld failure on 2nd takeover.
> -- Testsuite - fix python test 130_2.
> -- Fix security issue where a coordinator could add a user with elevated
> privileges. CVE-2025-43904.
> * Changes in Slurm 24.05.8
> ==========================
> -- Testsuite - fix python test 130_2.
> -- Fix security issue where a coordinator could add a user with elevated
> privileges. CVE-2025-43904.
> * Changes in Slurm 23.11.11
> ===========================
> -- Fixed a job requeuing issue that merged job entries into the same SLUID
> when all nodes in a job failed simultaneously.
> -- Add ABORT_ON_FATAL environment variable to capture a backtrace from any
> fatal() message.
> -- Testsuite - fix python test 130_2.
> -- Fix security issue where a coordinator could add a user with elevated
> privileges. CVE-2025-43904.
We are pleased to announce the availability of Slurm version 24.11.4.
This release fixes a variety of major to minor severity bugs. Some edge
cases that caused jobs to pend forever are fixed. Notable stability
issues that are fixed include:
* slurmctld crashing upon receiving a certain heterogeneous job submission.
* slurmd crashing after a communications failure with a slurmstepd.
* A variety of race conditions related to receiving and processing
connections, including one that resulted in the slurmd ignoring new RPC
connections.
Downloads are available at https://www.schedmd.com/downloads.php .
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
> -- slurmctld,slurmrestd - Avoid possible race condition that could have caused
> process to crash when listener socket was closed while accepting a new
> connection.
> -- slurmrestd - Avoid race condition that could have resulted in address
> logged for a UNIX socket to be incorrect.
> -- slurmrestd - Fix parameters in OpenAPI specification for the following
> endpoints to have "job_id" field:
> GET /slurm/v0.0.40/jobs/state/
> GET /slurm/v0.0.41/jobs/state/
> GET /slurm/v0.0.42/jobs/state/
> GET /slurm/v0.0.43/jobs/state/
> -- slurmd - Fix tracking of thread counts that could cause incoming
> connections to be ignored after burst of simultaneous incoming connections
> that trigger delayed response logic.
> -- Stepmgr - Avoid unnecessary SRUN_TIMEOUT forwarding to stepmgr.
> -- Fix jobs being scheduled on higher weighted powered down nodes.
> -- Fix how backfill scheduler filters nodes from the available nodes based on
> exclusive user and mcs_label requirements.
> -- acct_gather_energy/{gpu,ipmi} - Fix potential energy consumption adjustment
> calculation underflow.
> -- acct_gather_energy/ipmi - Fix regression introduced in 24.05.5 (which
> introduced the new way of preserving energy measurements through slurmd
> restarts) when EnergyIPMICalcAdjustment=yes.
> -- Prevent slurmctld deadlock in the assoc mgr.
> -- Fix memory leak when RestrictedCoresPerGPU is enabled.
> -- Fix preemptor jobs not entering execution due to wrong calculation of
> accounting policy limits.
> -- Fix certain job requests that were incorrectly denied with node
> configuration unavailable error.
> -- slurmd - Avoid crash due when slurmd has a communications failure with
> slurmstepd.
> -- Fix memory leak when parsing yaml input.
> -- Prevent slurmctld from showing error message about PreemptMode=GANG being a
> cluster-wide option for `scontrol update part` calls that don't attempt to
> modify partition PreemptMode.
> -- Fix setting GANG preemption on partition when updating PreemptMode with
> scontrol.
> -- Fix CoreSpec and MemSpec limits not being removed from previously
> configured slurmd.
> -- Avoid race condition that could lead to a deadlock when slurmd, slurmstepd,
> slurmctld, slurmrestd or sackd have a fatal event.
> -- Fix jobs using --ntasks-per-node and --mem keep pending forever when the
> requested mem divided by the number of cpus will surpass the configured
> MaxMemPerCPU.
> -- slurmd - Fix address logged upon new incoming RPC connection from "INVALID"
> to IP address.
> -- Fix memory leak when retrieving reservations. This affects scontrol, sinfo,
> sview, and the following slurmrestd endpoints:
> 'GET /slurm/{any_data_parser}/reservation/{reservation_name}'
> 'GET /slurm/{any_data_parser}/reservations'
> -- Log warning instead of debuflags=conmgr gated log when deferring new
> incoming connections when number of active connections exceed
> conmgr_max_connections.
> -- Avoid race condition that could result in worker thread pool not activating
> all threads at once after a reconfigure resulting in lower utilization of
> available CPU threads until enough internal activity wakes up all threads
> in the worker pool.
> -- Avoid theoretical race condition that could result in new incoming RPC
> socket connections being ignored after reconfigure.
> -- slurmd - Avoid race condition that could result in a state where new
> incoming RPC connections will always be ignored.
> -- Add ReconfigFlags=KeepNodeStateFuture to restore saved FUTURE node state on
> restart and reconfig instead of reverting to FUTURE state. This will be
> made the default in 25.05.
> -- Fix case where hetjob submit would cause slurmctld to crash.
> -- Fix jobs using --cpus-per-gpu and --mem keep pending forever when the
> requested mem divided by the number of cpus will surpass the configured
> MaxMemPerCPU.
> -- Enforce that jobs using --mem and several --*-per-* options do not violate
> the MaxMemPerCPU in place.
> -- slurmctld - Fix use-cases of jobs incorrectly pending held when --prefer
> features are not initially satisfied.
> -- slurmctld - Fix jobs incorrectly held when --prefer not satisfied in some
> use-cases.
> -- Ensure RestrictedCoresPerGPU and CoreSpecCount don't overlap.
We are pleased to announce the availability of Slurm version 24.05.7.
This release fixes some stability issues in 24.05, including a crash in
slurmctld after updating a reservation with an empty nodelist.
Downloads are available at https://www.schedmd.com/downloads.php .
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
> * Changes in Slurm 24.05.7
> ==========================
> -- Fix slurmctld crash when after updating a reservation with an empty
> nodelist. The crash could occur after restarting slurmctld, or if
> downing/draining a node in the reservation with the REPLACE or REPLACE_DOWN
> flag.
> -- Fix jobs being scheduled on higher weighted powered down
> nodes.
> -- Fix memory leak when RestrictedCoresPerGPU is enabled.
> -- Prevent slurmctld deadlock in the assoc mgr.