We are pleased to announce the availability of Slurm versions 25.05.1
and 24.11.6.
Changes in 25.05 include the following:
* Fix many issues with the TLS Certificate Manager introduced in 25.05
* Optimize account deletion
* Fix a bug when reordering the association hierarchy
* Fix some issues that cause daemon crashes
* Fix a variety of memory leaks
Changes in 24.11 include the following:
* Fix some issues that cause daemons to crash
* Fix some race conditions on shutdown that cause daemons to crash or hang
The full list of changes are available in the CHANGELOG for each version:
https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.mdhttps://github.com/SchedMD/slurm/blob/slurm-24.11/CHANGELOG/slurm-24.11.md
Slurm can be downloaded from:
https://www.schedmd.com/download-slurm/
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
We are pleased to announce the availability of Slurm release candidate
25.05.0rc1.
To highlight some new features coming in 25.05:
- Support for defining multiple topology configurations, and varying
them by partition.
- Support for tracking and allocating hierarchical resources.
- Dynamic nodes can be dynamically added to the topology.
- topology/block - Allow for gaps in the block layout.
- Support for encrypting all network communication with TLS.
- jobcomp/kafka - Optionally send job info at job start as well as job end.
- Support an OR operator in --license requests.
- switch/hpe_slingshot - Support for > 252 ranks per node.
- switch/hpe_slingshot - Support mTLS authentication to the fabric manager.
- sacctmgr - Add support for dumping and loading QOSes.
- srun - Add new --wait-for-children option to keep the step running
until all launched processes have been launched (cgroup/v2 only).
- slurmrestd - Add new endpoint for creating reservations.
This is the first release candidate of the upcoming 25.05 release
series, and represents the end of development for this release, and a
finalization of the RPC and state file formats.
If any issues are identified with this release candidate, please report
them through https://bugs.schedmd.com against the 25.05.x version and we
will address them before the first production 25.05.0 release is made.
Please note that the release candidates are not intended for production use.
A preview of the updated documentation can be found at
https://slurm.schedmd.com/archive/slurm-master/ .
Slurm can be downloaded from https://www.schedmd.com/download-slurm/.
The changelog for 25.05.0rc1 can be found here:
https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-25.05.md#chang…
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
Slurm versions 24.11.5, 24.05.8, and 23.11.11 are now available and
include a fix for a recently discovered security issue.
SchedMD customers were informed on April 23rd and provided a patch on
request; this process is documented in our security policy. [1]
A mistake with permission handling for Coordinators within Slurm's
accounting system can allow a Coordinator to promote a user to
Administrator. (CVE-2025-43904)
Thank you to Sekou Diakite (HPE) for reporting this.
Downloads are available at https://www.schedmd.com/downloads.php .
Release notes follow below.
- Tim
[1] https://www.schedmd.com/security-policy/
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
> * Changes in Slurm 24.11.5
> ==========================
> -- Return error to scontrol reboot on bad nodelists.
> -- slurmrestd - Report an error when QOS resolution fails for v0.0.40
> endpoints.
> -- slurmrestd - Report an error when QOS resolution fails for v0.0.41
> endpoints.
> -- slurmrestd - Report an error when QOS resolution fails for v0.0.42
> endpoints.
> -- data_parser/v0.0.42 - Added +inline_enums flag which modifies the
> output when generating OpenAPI specification. It causes enum arrays to not
> be defined in their own schema with references ($ref) to them. Instead they
> will be dumped inline.
> -- Fix binding error with tres-bind map/mask on partial node allocations.
> -- Fix stepmgr enabled steps being able to request features.
> -- Reject step creation if requested feature is not available in job.
> -- slurmd - Restrict listening for new incoming RPC requests further into
> startup.
> -- slurmd - Avoid auth/slurm related hangs of CLI commands during startup
> and shutdown.
> -- slurmctld - Restrict processing new incoming RPC requests further into
> startup. Stop processing requests sooner during shutdown.
> -- slurmcltd - Avoid auth/slurm related hangs of CLI commands during
> startup and shutdown.
> -- slurmctld: Avoid race condition during shutdown or reconfigure that
> could result in a crash due delayed processing of a connection while
> plugins are unloaded.
> -- Fix small memleak when getting the job list from the database.
> -- Fix incorrect printing of % escape characters when printing stdio
> fields for jobs.
> -- Fix padding parsing when printing stdio fields for jobs.
> -- Fix printing %A array job id when expanding patterns.
> -- Fix reservations causing jobs to be held for Bad Constraints
> -- switch/hpe_slingshot - Prevent potential segfault on failed curl
> request to the fabric manager.
> -- Fix printing incorrect array job id when expanding stdio file names.
> The %A will now be substituted by the correct value.
> -- Fix printing incorrect array job id when expanding stdio file names.
> The %A will now be substituted by the correct value.
> -- switch/hpe_slingshot - Fix vni range not updating on slurmctld restart
> or reconfigre.
> -- Fix steps not being created when using certain combinations of -c and
> -n inferior to the jobs requested resources, when using stepmgr and nodes
> are configured with CPUs == Sockets*CoresPerSocket.
> -- Permit configuring the number of retry attempts to destroy CXI service
> via the new destroy_retries SwitchParameter.
> -- Do not reset memory.high and memory.swap.max in slurmd startup or
> reconfigure as we are never really touching this in slurmd.
> -- Fix reconfigure failure of slurmd when it has been started manually and
> the CoreSpecLimits have been removed from slurm.conf.
> -- Set or reset CoreSpec limits when slurmd is reconfigured and it was
> started with systemd.
> -- switch/hpe-slingshot - Make sure the slurmctld can free step VNIs after
> the controller restarts or reconfigures while the job is running.
> -- Fix backup slurmctld failure on 2nd takeover.
> -- Testsuite - fix python test 130_2.
> -- Fix security issue where a coordinator could add a user with elevated
> privileges. CVE-2025-43904.
> * Changes in Slurm 24.05.8
> ==========================
> -- Testsuite - fix python test 130_2.
> -- Fix security issue where a coordinator could add a user with elevated
> privileges. CVE-2025-43904.
> * Changes in Slurm 23.11.11
> ===========================
> -- Fixed a job requeuing issue that merged job entries into the same SLUID
> when all nodes in a job failed simultaneously.
> -- Add ABORT_ON_FATAL environment variable to capture a backtrace from any
> fatal() message.
> -- Testsuite - fix python test 130_2.
> -- Fix security issue where a coordinator could add a user with elevated
> privileges. CVE-2025-43904.
We are pleased to announce the availability of Slurm version 24.11.4.
This release fixes a variety of major to minor severity bugs. Some edge
cases that caused jobs to pend forever are fixed. Notable stability
issues that are fixed include:
* slurmctld crashing upon receiving a certain heterogeneous job submission.
* slurmd crashing after a communications failure with a slurmstepd.
* A variety of race conditions related to receiving and processing
connections, including one that resulted in the slurmd ignoring new RPC
connections.
Downloads are available at https://www.schedmd.com/downloads.php .
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
> -- slurmctld,slurmrestd - Avoid possible race condition that could have caused
> process to crash when listener socket was closed while accepting a new
> connection.
> -- slurmrestd - Avoid race condition that could have resulted in address
> logged for a UNIX socket to be incorrect.
> -- slurmrestd - Fix parameters in OpenAPI specification for the following
> endpoints to have "job_id" field:
> GET /slurm/v0.0.40/jobs/state/
> GET /slurm/v0.0.41/jobs/state/
> GET /slurm/v0.0.42/jobs/state/
> GET /slurm/v0.0.43/jobs/state/
> -- slurmd - Fix tracking of thread counts that could cause incoming
> connections to be ignored after burst of simultaneous incoming connections
> that trigger delayed response logic.
> -- Stepmgr - Avoid unnecessary SRUN_TIMEOUT forwarding to stepmgr.
> -- Fix jobs being scheduled on higher weighted powered down nodes.
> -- Fix how backfill scheduler filters nodes from the available nodes based on
> exclusive user and mcs_label requirements.
> -- acct_gather_energy/{gpu,ipmi} - Fix potential energy consumption adjustment
> calculation underflow.
> -- acct_gather_energy/ipmi - Fix regression introduced in 24.05.5 (which
> introduced the new way of preserving energy measurements through slurmd
> restarts) when EnergyIPMICalcAdjustment=yes.
> -- Prevent slurmctld deadlock in the assoc mgr.
> -- Fix memory leak when RestrictedCoresPerGPU is enabled.
> -- Fix preemptor jobs not entering execution due to wrong calculation of
> accounting policy limits.
> -- Fix certain job requests that were incorrectly denied with node
> configuration unavailable error.
> -- slurmd - Avoid crash due when slurmd has a communications failure with
> slurmstepd.
> -- Fix memory leak when parsing yaml input.
> -- Prevent slurmctld from showing error message about PreemptMode=GANG being a
> cluster-wide option for `scontrol update part` calls that don't attempt to
> modify partition PreemptMode.
> -- Fix setting GANG preemption on partition when updating PreemptMode with
> scontrol.
> -- Fix CoreSpec and MemSpec limits not being removed from previously
> configured slurmd.
> -- Avoid race condition that could lead to a deadlock when slurmd, slurmstepd,
> slurmctld, slurmrestd or sackd have a fatal event.
> -- Fix jobs using --ntasks-per-node and --mem keep pending forever when the
> requested mem divided by the number of cpus will surpass the configured
> MaxMemPerCPU.
> -- slurmd - Fix address logged upon new incoming RPC connection from "INVALID"
> to IP address.
> -- Fix memory leak when retrieving reservations. This affects scontrol, sinfo,
> sview, and the following slurmrestd endpoints:
> 'GET /slurm/{any_data_parser}/reservation/{reservation_name}'
> 'GET /slurm/{any_data_parser}/reservations'
> -- Log warning instead of debuflags=conmgr gated log when deferring new
> incoming connections when number of active connections exceed
> conmgr_max_connections.
> -- Avoid race condition that could result in worker thread pool not activating
> all threads at once after a reconfigure resulting in lower utilization of
> available CPU threads until enough internal activity wakes up all threads
> in the worker pool.
> -- Avoid theoretical race condition that could result in new incoming RPC
> socket connections being ignored after reconfigure.
> -- slurmd - Avoid race condition that could result in a state where new
> incoming RPC connections will always be ignored.
> -- Add ReconfigFlags=KeepNodeStateFuture to restore saved FUTURE node state on
> restart and reconfig instead of reverting to FUTURE state. This will be
> made the default in 25.05.
> -- Fix case where hetjob submit would cause slurmctld to crash.
> -- Fix jobs using --cpus-per-gpu and --mem keep pending forever when the
> requested mem divided by the number of cpus will surpass the configured
> MaxMemPerCPU.
> -- Enforce that jobs using --mem and several --*-per-* options do not violate
> the MaxMemPerCPU in place.
> -- slurmctld - Fix use-cases of jobs incorrectly pending held when --prefer
> features are not initially satisfied.
> -- slurmctld - Fix jobs incorrectly held when --prefer not satisfied in some
> use-cases.
> -- Ensure RestrictedCoresPerGPU and CoreSpecCount don't overlap.
We are pleased to announce the availability of Slurm version 24.05.7.
This release fixes some stability issues in 24.05, including a crash in
slurmctld after updating a reservation with an empty nodelist.
Downloads are available at https://www.schedmd.com/downloads.php .
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
> * Changes in Slurm 24.05.7
> ==========================
> -- Fix slurmctld crash when after updating a reservation with an empty
> nodelist. The crash could occur after restarting slurmctld, or if
> downing/draining a node in the reservation with the REPLACE or REPLACE_DOWN
> flag.
> -- Fix jobs being scheduled on higher weighted powered down
> nodes.
> -- Fix memory leak when RestrictedCoresPerGPU is enabled.
> -- Prevent slurmctld deadlock in the assoc mgr.
We are pleased to announce the availability of Slurm version 24.11.3.
24.11.3 fixes the database cluster ID generation not being random, a
regression in which slurmd -G gave no output, a long-standing crash in
slurmctld after updating a reservation with an empty nodelist, and some
other minor to moderate bugs.
Downloads are available at https://www.schedmd.com/downloads.php .
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
> * Changes in Slurm 24.11.3
> ==========================
> -- Fix race condition in slurmrestd that resulted in "Requested
> data_parser plugin does not support OpenAPI plugin" error being returned
> for valid endpoints.
> -- If multiple partitions are requested, set the SLURM_JOB_PARTITION
> output environment variable to the partition in which the job is running
> for salloc and srun in order to match the documentation and the behavior of
> sbatch.
> -- Fix regression where slurmd -G gives no output.
> -- Don't print misleading errors for stepmgr enabled steps.
> -- slurmrestd - Avoid connection to slurmdbd for the following
> endpoints:
> GET /slurm/v0.0.41/jobs
> GET /slurm/v0.0.41/job/{job_id}
> -- slurmrestd - Avoid connection to slurmdbd for the following
> endpoints:
> GET /slurm/v0.0.40/jobs
> GET /slurm/v0.0.40/job/{job_id}
> -- Significantly increase entropy of clusterid portion of the
> sluid by seeding the random number generator
> -- Avoid changing process name to "watch" from original daemon name.
> This could potentially breaking some monitoring scripts.
> -- Avoid slurmctld being killed by SIGALRM due to race condition
> at startup.
> -- Fix slurmctld crash when after updating a reservation with an empty
> nodelist. The crash could occur after restarting slurmctld, or if
> downing/draining a node in the reservation with the REPLACE or REPLACE_DOWN
> flag.
> -- Fix race between task/cgroup cpuset and jobacctgather/cgroup.
> The first was removing the pid from task_X cgroup directory causing
> memory limits to not be applied.
> -- srun - Fixed wrongly constructed SLURM_CPU_BIND env variable
> that could get propagated to downward srun calls in certain mpi
> environments, causing launch failures.
> -- slurmrestd - Fix possible memory leak when parsing arrays with
> data_parser/v0.0.40.
> -- slurmrestd - Fix possible memory leak when parsing arrays with
> data_parser/v0.0.41.
> -- slurmrestd - Fix possible memory leak when parsing arrays with
> data_parser/v0.0.42.
We are pleased to announce the availability of Slurm versions 24.11.2
and 24.05.6.
24.11.2 fixes a variety of minor to major bugs. Fixed regressions
include loading non-default QOS on pending jobs from pre-24.11 state,
pending jobs displaying QOS=(null) when not explicitly requesting a QOS,
running jobs that requested multiple partitions potentially having an
incorrect partition when slurmctld is restarted, and burst_buffer.lua
failing if slurm.conf is in a non-standard location. This release also
fixes a few crashes in slurmctld: crashing when a job that can preempt
requests --test-only, crasing when the scheduler evaluates a job on
nodes with suspended jobs, and crashing due to a long-standing bug
causing a job record without job_resrcs.
24.05.6 fixes sattach with auth/slurm, a slurmrestd crash when using
data_parser/v0.0.40, a slurmctld crash when using job suspension, a
performance regression for RPCs with large amounts of data, and some
other moderate severity bugs.
Downloads are available at https://www.schedmd.com/downloads.php .
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
> * Changes in Slurm 24.11.2
> ==========================
> -- Fix segfault when submitting --test-only jobs that can preempt.
> -- Fix regression introduced in 23.11 that prevented the following
> flags from being added to a reservation on an update:
> DAILY, HOURLY, WEEKLY, WEEKDAY, and WEEKEND.
> -- Fix crash and issues evaluating job's suitability for running in
> nodes with already suspended job(s) there.
> -- Slurmctld will ensure that healthy nodes are not reported as
> UnavailableNodes in job reason codes.
> -- Fix handling of jobs submitted to a current reservation with
> flags OVERLAP,FLEX or OVERLAP,ANY_NODES when it overlaps nodes with a
> future maintenance reservation. When a job submission had a time limit that
> overlapped with the future maintenance reservation, it was rejected. Now
> the job is accepted but stays pending with the reason "ReqNodeNotAvail,
> Reserved for maintenance".
> -- pam_slurm_adopt - avoid errors when explicitly setting
> some arguments to the default value.
> -- Fix qos preemption with PreemptMode=SUSPEND
> -- slurmdbd - When changing a user's name update lineage
> at the same time.
> -- Fix regression in 24.11 in which burst_buffer.lua does not
> inherit the SLURM_CONF environment variable from slurmctld and fails to run
> if slurm.conf is in a non-standard location.
> -- Fix memory leak in slurmctld if select/linear and the
> PreemptParameters=reclaim_licenses options are both set in slurm.conf.
> Regression in 24.11.1.
> -- Fix running jobs, that requested multiple partitions, from
> potentially being set to the wrong partition on restart.
> -- switch/hpe_slingshot - Fix compatibility with newer cxi
> drivers, specifically when specifying disable_rdzv_get.
> -- Add ABORT_ON_FATAL environment variable to capture a backtrace
> from any fatal() message.
> -- Fix printing invalid address in rate limiting log statement.
> -- sched/backfill - Fix node state PLANNED not being cleared from
> fully allocated nodes during a backfill cycle.
> -- select/cons_tres - Fix future planning of jobs with bf_licenses.
> -- Prevent redundant "on_data returned rc: Rate limit exceeded,
> please retry momentarily" error message from being printed in
> slurmctld logs.
> -- Fix loading non-default QOS on pending jobs from pre-24.11 state.
> -- Fix pending jobs displaying QOS=(null) when not explicitly
> requesting a QOS.
> -- Fix segfault issue from job record with no job_resrcs
> -- Fix failing sacctmgr delete/modify/show account operations
> with where clauses.
> -- Fix regression in 24.11 in which Slurm daemons started catching
> several SIGTSTP, SIGTTIN and SIGUSR1 signals and ignored them, while before
> they were not ignoring them. This also caused slurmctld to not being
> able to shutdown after a SIGTSTP because slurmscriptd caught the signal
> and stopped while slurmctld ignored it. Unify and fix these situations and
> get back to the previous behavior for these signals.
> -- Document that SIGQUIT is no longer ignored by slurmctld,
> slurmdbd, and slurmd in 24.11. As of 24.11.0rc1, SIGQUIT is identical to
> SIGINT and SIGTERM for these daemons, but this change was not documented.
> -- Fix not considering nodes marked for reboot without ASAP
> in the scheduler.
> -- Remove the boot^ state on unexpected node reboot after
> return to service.
> -- Do not allow new jobs to start on a node which is being rebooted
> with the flag nextstate=resume.
> -- Prevent lower priority job running after cancelling an ASAP reboot.
> -- Fix srun jobs starting on nextstate=resume rebooting nodes.
>
> * Changes in Slurm 24.05.6
> ==========================
> -- data_parser/v0.0.40 - Prevent a segfault in the slurmrestd when
> dumping data with v0.0.40+complex data parser.
> -- Fix sattach when using auth/slurm.
> -- scrun - Add support '--all' argument for kill subcommand.
> -- Fix performance regression while packing larger RPCs.
> -- Fix crash and issues evaluating job's suitability for running in
> nodes with already suspended job(s) there.
> -- Fixed a job requeuing issue that merged job entries into the
> same SLUID when all nodes in a job failed simultaneously.
> -- switch/hpe_slingshot - Fix compatibility with newer cxi
> drivers, specifically when specifying disable_rdzv_get.
> -- Add ABORT_ON_FATAL environment variable to capture a backtrace
> from any fatal() message.
We are pleased to announce the availability of Slurm version 24.11.1.
This fixes a few possible crashes of the slurmctld and slurmrestd; a
regression in 24.11 which caused file transfers to a job with sbcast to
not join the job container namespace; mpi apps using Intel OPA, PSM2 and
OMPI 5.x when ran through srun; and various minor to moderate bugs.
Downloads are available at https://www.schedmd.com/downloads.php .
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
> * Changes in Slurm 24.11.1
> ==========================
> -- With client commands MIN_MEMORY will show mem_per_tres if specified.
> -- Fix errno message about bad constraint
> -- slurmctld - Fix crash and possible split brain issue if the
> backup controller handles an scontrol reconfigure while in control
> before the primary resumes operation.
> -- Fix stepmgr not getting dynamic node addrs from the controller
> -- stepmgr - avoid "Unexpected missing socket" errors.
> -- Fix `scontrol show steps` with dynamic stepmgr
> -- Deny jobs using the "R:" option of --signal if PreemptMode=OFF
> globally.
> -- Force jobs using the "R:" option of --signal to be preemptable
> by requeue or cancel only. If PreemptMode on the partition or QOS is off
> or suspend, the job will default to using PreemptMode=cancel.
> -- If --mem-per-cpu exceeds MaxMemPerCPU, the number of cpus per
> task will always be increased even if --cpus-per-task was specified. This
> is needed to ensure each task gets the expected amount of memory.
> -- Fix compilation issue on OpenSUSE Leap 15
> -- Fix jobs using more nodes than needed when not using -N
> -- Fix issue with allocation being allocated less resources
> than needed when using --gres-flags=enforce-binding.
> -- select/cons_tres - Fix errors with MaxCpusPerSocket partition
> limit. Used cpus/cores weren't counted properly, nor limiting free ones
> to avail, when the socket was partially allocated, or the job request
> went beyond this limit.
> -- Fix issue when jobs were preempted for licenses even if there
> were enough licenses available.
> -- Fix srun ntasks calculation inside an allocation when nodes are
> requested using a min-max range.
> -- Print correct number of digits for TmpDisk in sdiag.
> -- Fix a regression in 24.11 which caused file transfers to a job
> with sbcast to not join the job container namespace.
> -- data_parser/v0.0.40 - Prevent a segfault in the slurmrestd when
> dumping data with v0.0.40+complex data parser.
> -- Remove logic to force lowercase GRES names.
> -- data_parser/v0.0.42 - Prevent the association id from always
> being dumped as NULL when parsing in complex mode. Instead it will now
> dump the id. This affects the following endpoints:
> GET slurmdb/v0.0.42/association
> GET slurmdb/v0.0.42/associations
> GET slurmdb/v0.0.42/config
> -- Fixed a job requeuing issue that merged job entries into the
> same SLUID when all nodes in a job failed simultaneously.
> -- When a job completes, try to give idle nodes to reservations with
> the REPLACE flag before allowing them to be allocated to jobs.
> -- Avoid expensive lookup of all associations when dumping or
> parsing for v0.0.42 endpoints.
> -- Avoid expensive lookup of all associations when dumping or
> parsing for v0.0.41 endpoints.
> -- Avoid expensive lookup of all associations when dumping or
> parsing for v0.0.40 endpoints.
> -- Fix segfault when testing jobs against nodes with invalid gres.
> -- Fix performance regression while packing larger RPCs.
> -- Document the new mcs/label plugin.
> -- job_container/tmpfs - Fix Xauthoirty file being created
> outside the container when EntireStepInNS is enabled.
> -- job_container/tmpfs - Fix spank_task_post_fork not always
> running in the container when EntireStepInNS is enabled.
> -- Fix a job potentially getting stuck in CG on permissions
> errors while setting up X11 forwarding.
> -- Fix error on X11 shutdown if Xauthority file was not created.
> -- slurmctld - Fix memory or fd leak if an RPC is recieved that
> is not registered for processing.
> -- Inject OMPI_MCA_orte_precondition_transports when using PMIx. This fixes
> mpi apps using Intel OPA, PSM2 and OMPI 5.x when ran through srun.
> -- Don't skip the first partition_job_depth jobs per partition.
> -- Fix gres allocation issue after controller restart.
> -- Fix issue where jobs requesting cpus-per-gpu hang in queue.
> -- switch/hpe_slingshot - Treat HTTP status forbidden the same as
> unauthorized, allowing for a graceful retry attempt.
We are pleased to announce the availability of Slurm version 24.05.5.
This release fixes a few potential crashes, several stepmgr bugs,
compatibility for sstat and sattach with newer version steps, and some
other minor bugs.
Downloads are available at https://www.schedmd.com/downloads.php .
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
> * Changes in Slurm 24.05.5
> ==========================
> -- Fix issue signaling cron jobs resulting in unintended requeues.
> -- Fix slurmctld memory leak in implementation of HealthCheckNodeState=CYCLE.
> -- job_container/tmpfs - Fix SLURM_CONF env variable not being properly set.
> -- sched/backfill - Fix job's time_limit being overwritten by time_min for job
> arrays in some situations.
> -- RoutePart - fix segfault from incorrect memory allocation when node doesn't
> exist in any partition.
> -- slurmctld - Fix crash when a job is evaluated for a reservation after
> removal of a dynamic node.
> -- gpu/nvml - Attempt loading libnvidia-ml.so.1 as a fallback for failure in
> loading libnvidia-ml.so.
> -- slurmrestd - Fix populating non-required object fields of objects as '{}' in
> JSON/YAML instead of 'null' causing compiled OpenAPI clients to reject
> the response to 'GET /slurm/v0.0.40/jobs' due to validation failure of
> '.jobs[].job_resources'.
> -- Fix sstat/sattach protocol errors for steps on higher version slurmd's
> (regressions since 20.11.0rc1 and 16.05.1rc1 respectively).
> -- slurmd - Avoid a crash when starting slurmd version 24.05 with
> SlurmdSpoolDir files that have been upgraded to a newer major version of
> Slurm. Log warnings instead.
> -- Fix race condition in stepmgr step completion handling.
> -- Fix slurmctld segfault with stepmgr and MpiParams when running a job array.
> -- Fix requeued jobs keeping their priority until the decay thread happens.
> -- slurmctld - Fix crash and possible split brain issue if the
> backup controller handles an scontrol reconfigure while in control
> before the primary resumes operation.
> -- Fix stepmgr not getting dynamic node addrs from the controller
> -- stepmgr - avoid "Unexpected missing socket" errors.
> -- Fix `scontrol show steps` with dynamic stepmgr
> -- Support IPv6 in configless mode