Slurm version 24.11.4 is now available - slurm-users

8 Apr 2025


      We are pleased to announce the availability of Slurm version 24.11.4.
This release fixes a variety of major to minor severity bugs. Some edge 
cases that caused jobs to pend forever are fixed. Notable stability 
issues that are fixed include:
* slurmctld crashing upon receiving a certain heterogeneous job submission.
* slurmd crashing after a communications failure with a slurmstepd.
* A variety of race conditions related to receiving and processing 
connections, including one that resulted in the slurmd ignoring new RPC 
connections.
Downloads are available at https://www.schedmd.com/downloads.php .
-- 
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support

>  -- slurmctld,slurmrestd - Avoid possible race condition that could have caused
>     process to crash when listener socket was closed while accepting a new
>     connection.
>  -- slurmrestd - Avoid race condition that could have resulted in address
>     logged for a UNIX socket to be incorrect.
>  -- slurmrestd - Fix parameters in OpenAPI specification for the following
>     endpoints to have "job_id" field:
>     GET /slurm/v0.0.40/jobs/state/
>     GET /slurm/v0.0.41/jobs/state/
>     GET /slurm/v0.0.42/jobs/state/
>     GET /slurm/v0.0.43/jobs/state/
>  -- slurmd - Fix tracking of thread counts that could cause incoming
>     connections to be ignored after burst of simultaneous incoming connections
>     that trigger delayed response logic.
>  -- Stepmgr - Avoid unnecessary SRUN_TIMEOUT forwarding to stepmgr.
>  -- Fix jobs being scheduled on higher weighted powered down nodes.
>  -- Fix how backfill scheduler filters nodes from the available nodes based on
>     exclusive user and mcs_label requirements.
>  -- acct_gather_energy/{gpu,ipmi} - Fix potential energy consumption adjustment
>     calculation underflow.
>  -- acct_gather_energy/ipmi - Fix regression introduced in 24.05.5 (which
>     introduced the new way of preserving energy measurements through slurmd
>     restarts) when EnergyIPMICalcAdjustment=yes.
>  -- Prevent slurmctld deadlock in the assoc mgr.
>  -- Fix memory leak when RestrictedCoresPerGPU is enabled.
>  -- Fix preemptor jobs not entering execution due to wrong calculation of
>     accounting policy limits.
>  -- Fix certain job requests that were incorrectly denied with node
>     configuration unavailable error.
>  -- slurmd - Avoid crash due when slurmd has a communications failure with
>     slurmstepd.
>  -- Fix memory leak when parsing yaml input.
>  -- Prevent slurmctld from showing error message about PreemptMode=GANG being a
>     cluster-wide option for `scontrol update part` calls that don't attempt to
>     modify partition PreemptMode.
>  -- Fix setting GANG preemption on partition when updating PreemptMode with
>     scontrol.
>  -- Fix CoreSpec and MemSpec limits not being removed from previously
>     configured slurmd.
>  -- Avoid race condition that could lead to a deadlock when slurmd, slurmstepd,
>     slurmctld, slurmrestd or sackd have a fatal event.
>  -- Fix jobs using --ntasks-per-node and --mem keep pending forever when the
>     requested mem divided by the number of cpus will surpass the configured
>     MaxMemPerCPU.
>  -- slurmd - Fix address logged upon new incoming RPC connection from "INVALID"
>     to IP address.
>  -- Fix memory leak when retrieving reservations. This affects scontrol, sinfo,
>     sview, and the following slurmrestd endpoints:
>     'GET /slurm/{any_data_parser}/reservation/{reservation_name}'
>     'GET /slurm/{any_data_parser}/reservations'
>  -- Log warning instead of debuflags=conmgr gated log when deferring new
>     incoming connections when number of active connections exceed
>     conmgr_max_connections.
>  -- Avoid race condition that could result in worker thread pool not activating
>     all threads at once after a reconfigure resulting in lower utilization of
>     available CPU threads until enough internal activity wakes up all threads
>     in the worker pool.
>  -- Avoid theoretical race condition that could result in new incoming RPC
>     socket connections being ignored after reconfigure.
>  -- slurmd - Avoid race condition that could result in a state where new
>     incoming RPC connections will always be ignored.
>  -- Add ReconfigFlags=KeepNodeStateFuture to restore saved FUTURE node state on
>     restart and reconfig instead of reverting to FUTURE state. This will be
>     made the default in 25.05.
>  -- Fix case where hetjob submit would cause slurmctld to crash.
>  -- Fix jobs using --cpus-per-gpu and --mem keep pending forever when the
>     requested mem divided by the number of cpus will surpass the configured
>     MaxMemPerCPU.
>  -- Enforce that jobs using --mem and several --*-per-* options do not violate
>     the MaxMemPerCPU in place.
>  -- slurmctld - Fix use-cases of jobs incorrectly pending held when --prefer
>     features are not initially satisfied.
>  -- slurmctld - Fix jobs incorrectly held when --prefer not satisfied in some
>     use-cases.
>  -- Ensure RestrictedCoresPerGPU and CoreSpecCount don't overlap.