Slurm version 23.11.4 is now available - slurm-announce

22 Feb 2024


      We are pleased to announce the availability of Slurm version 23.11.4.
The 23.11.4 release includes a number of fixes to stability and various 
bug fixes. Some notable changes include that VSZ is no longer being 
reported when using cgroup/v2 (this is not provided by the kernel), a 
warning has been added if using select/linear and tolology/tree together 
as this will not be supported in the next major release, and a backwards 
compatibility issue that caused jobs using --gpus to be rejected when 
submitted from 23.02 or 22.05.
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
In addition, welcome to the updated slurm-announce list! We've made some 
mailing list adjustments in order to ensure compliance with newer 
anti-spam measures, and upgraded to Mailman3 as part of this process.
- Tim
-- 
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support

> * Changes in Slurm 23.11.4
> ==========================
>  -- Fix a memory leak when updating partition nodes.
>  -- Don't leave a partition around if it fails to create with scontrol.
>  -- Fix segfault when creating partition with bad node list from scontrol.
>  -- Fix preserving partition nodes on bad node list update from scontrol.
>  -- Fix assertion in developer mode on a failed message unpack.
>  -- Fix repeat POWER_DOWN requests making the nodes available for ping.
>  -- Fix rebuilding job alias_list on restart when nodes are still powering up.
>  -- Fix INVALID nodes running health check.
>  -- Fix cloud/future nodes not setting addresses on invalid registration.
>  -- scrun - Remove the requirement to set the SCRUN_WORKING_DIR environment
>     variable. This was a regression in 23.11.
>  -- Add warning for using select/linear with topology/tree.
>     This combination will not be supported in the next major version.
>  -- Fix health check program not being run after first pass of all nodes when
>     using MaxNodeCount.
>  -- sacct - Set process exit code to one for all errors.
>  -- Add SlurmctldParameters=disable_triggers option.
>  -- Fix issue running steps when the allocation requested an exclusive
>     allocation shards along with shards.
>  -- Fix cleaning up the sleep process and the cgroup of the extern step if
>     slurm_spank_task_post_fork returns an error.
>  -- slurm_completion - Add missing --gres-flags= options
>     multiple-tasks-per-sharing and one-task-per-sharing.
>  -- scrun - Avoid race condition that could cause outbound network
>     communications to incorrectly rejected with an incomplete packet error.
>  -- scrun - Gracefully handle kernel giving invalid expected number of incoming
>     bytes for a connection causing incoming packet corruption resulting in
>     connection getting closed.
>  -- srun - return 1 when a step lauch fails
>  -- scrun - Avoid race condition that could cause deadlock during shutdown.
>  -- Fix scontrol listpids to work under dynamic node scenarios.
>  -- Add --tres-bind to --help and --usage output.
>  -- Add --gres-flags=allow-task-sharing to allow GPUs to still be accessible
>     among all tasks when binding GPUs to specific tasks.
>  -- Fix issue with CUDA_VISIBLE_DEVICES showing the same MIG device for all
>     tasks when using MIGs with --tres-per-task or --gpus-per-task.
>  -- slurmctld - Prevent a potential hang during shutdown/reconfigure if the
>     association cache thread was previously shut down.
>  -- scrun - Avoid race condition that could cause scrun to hang during
>     shutdown when connections have pending events.
>  -- scrun - Avoid excessive polling of connections during shutdown that could
>     needlessly cause 100% CPU usage on a thread.
>  -- sbcast - Use user identity from broadcast credential instead of looking it
>     up locally on the node.
>  -- scontrol - Remove "abort" option handling.
>  -- Fix an error message referring to the wrong RPC.
>  -- Fix memory leak on error when creating dynamic nodes.
>  -- Fix a slurmctld segfault when a cloud/dynamic node changes hostname on
>     registration.
>  -- Prevent a slurmctld deadlock if the gpu plugin fails to load when
>     creating a node.
>  -- Change a slurmctld fatal() to an error() when attempting to create a
>     dynamic node with a global autodetect set in gres.conf.
>  -- Fix leaving node records on error when creating nodes with scontrol.
>  -- scrun/sackd - Avoid race condition where shutdown could deadlock.
>  -- Fix a regression in 23.02.5 that caused pam_slurm_adopt to fail when
>     the user has multiple jobs on a node.
>  -- Add GLOB_SILENCE flag that silences the error message which will display if
>     an include directive attempts to use the "*" wildcard.
>  -- Fix jobs getting rejected when submitting with --gpus option from older
>     versions of job submission commands (23.02 and older).
>  -- cgroup/v2 - Return 0 for VSZ. Kernel cgroups do not provide this metric.
>  -- scrun - Avoid race condition where outbound RPCs could be corrupted.
>  -- scrun - Avoid race condition that could cause a crash while compiled in
>     debug mode.
>  -- gpu/rsmi - Disable gpu usage statistics when not using ROCM 6.0.0+
>  -- Fix stuck processes and incorrect environment when using --get-user-env.
>  -- Avoid segfault in the slurmdbd when TrackWCKey=no but you are still using
>     use WCKeys.
>  -- Fix ctld segfault with TopologyParam=RoutePart and no partition defined.
>  -- slurmctld - Fix missing --deadline handling for jobs not evaluated by the
>     schedulers (i.e. non-runnable, skipped for other reasons, etc.).
>  -- Demote some eio related logs from error to verbose in user commands.  These
>     are not generally actionable by the user and are easilly generated by port
>     scanning a machine running srun.
>  -- Make sprio correctly print array tasks that have not yet been split out.
>  -- topology/block - Restrict the number of last-level blocks in any allocation.