Slurm version 24.11.4 is now available
We are pleased to announce the availability of Slurm version 24.11.4. This release fixes a variety of major to minor severity bugs. Some edge cases that caused jobs to pend forever are fixed. Notable stability issues that are fixed include: * slurmctld crashing upon receiving a certain heterogeneous job submission. * slurmd crashing after a communications failure with a slurmstepd. * A variety of race conditions related to receiving and processing connections, including one that resulted in the slurmd ignoring new RPC connections. Downloads are available at https://www.schedmd.com/downloads.php . -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support
-- slurmctld,slurmrestd - Avoid possible race condition that could have caused process to crash when listener socket was closed while accepting a new connection. -- slurmrestd - Avoid race condition that could have resulted in address logged for a UNIX socket to be incorrect. -- slurmrestd - Fix parameters in OpenAPI specification for the following endpoints to have "job_id" field: GET /slurm/v0.0.40/jobs/state/ GET /slurm/v0.0.41/jobs/state/ GET /slurm/v0.0.42/jobs/state/ GET /slurm/v0.0.43/jobs/state/ -- slurmd - Fix tracking of thread counts that could cause incoming connections to be ignored after burst of simultaneous incoming connections that trigger delayed response logic. -- Stepmgr - Avoid unnecessary SRUN_TIMEOUT forwarding to stepmgr. -- Fix jobs being scheduled on higher weighted powered down nodes. -- Fix how backfill scheduler filters nodes from the available nodes based on exclusive user and mcs_label requirements. -- acct_gather_energy/{gpu,ipmi} - Fix potential energy consumption adjustment calculation underflow. -- acct_gather_energy/ipmi - Fix regression introduced in 24.05.5 (which introduced the new way of preserving energy measurements through slurmd restarts) when EnergyIPMICalcAdjustment=yes. -- Prevent slurmctld deadlock in the assoc mgr. -- Fix memory leak when RestrictedCoresPerGPU is enabled. -- Fix preemptor jobs not entering execution due to wrong calculation of accounting policy limits. -- Fix certain job requests that were incorrectly denied with node configuration unavailable error. -- slurmd - Avoid crash due when slurmd has a communications failure with slurmstepd. -- Fix memory leak when parsing yaml input. -- Prevent slurmctld from showing error message about PreemptMode=GANG being a cluster-wide option for `scontrol update part` calls that don't attempt to modify partition PreemptMode. -- Fix setting GANG preemption on partition when updating PreemptMode with scontrol. -- Fix CoreSpec and MemSpec limits not being removed from previously configured slurmd. -- Avoid race condition that could lead to a deadlock when slurmd, slurmstepd, slurmctld, slurmrestd or sackd have a fatal event. -- Fix jobs using --ntasks-per-node and --mem keep pending forever when the requested mem divided by the number of cpus will surpass the configured MaxMemPerCPU. -- slurmd - Fix address logged upon new incoming RPC connection from "INVALID" to IP address. -- Fix memory leak when retrieving reservations. This affects scontrol, sinfo, sview, and the following slurmrestd endpoints: 'GET /slurm/{any_data_parser}/reservation/{reservation_name}' 'GET /slurm/{any_data_parser}/reservations' -- Log warning instead of debuflags=conmgr gated log when deferring new incoming connections when number of active connections exceed conmgr_max_connections. -- Avoid race condition that could result in worker thread pool not activating all threads at once after a reconfigure resulting in lower utilization of available CPU threads until enough internal activity wakes up all threads in the worker pool. -- Avoid theoretical race condition that could result in new incoming RPC socket connections being ignored after reconfigure. -- slurmd - Avoid race condition that could result in a state where new incoming RPC connections will always be ignored. -- Add ReconfigFlags=KeepNodeStateFuture to restore saved FUTURE node state on restart and reconfig instead of reverting to FUTURE state. This will be made the default in 25.05. -- Fix case where hetjob submit would cause slurmctld to crash. -- Fix jobs using --cpus-per-gpu and --mem keep pending forever when the requested mem divided by the number of cpus will surpass the configured MaxMemPerCPU. -- Enforce that jobs using --mem and several --*-per-* options do not violate the MaxMemPerCPU in place. -- slurmctld - Fix use-cases of jobs incorrectly pending held when --prefer features are not initially satisfied. -- slurmctld - Fix jobs incorrectly held when --prefer not satisfied in some use-cases. -- Ensure RestrictedCoresPerGPU and CoreSpecCount don't overlap.
participants (1)
-
Marshall Garey