We are pleased to announce the availability of Slurm version 23.11.4.
The 23.11.4 release includes a number of fixes to stability and various
bug fixes. Some notable changes include that VSZ is no longer being
reported when using cgroup/v2 (this is not provided by the kernel), a
warning has been added if using select/linear and tolology/tree together
as this will not be supported in the next major release, and a backwards
compatibility issue that caused jobs using --gpus to be rejected when
submitted from 23.02 or 22.05.
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
In addition, welcome to the updated slurm-announce list! We've made some
mailing list adjustments in order to ensure compliance with newer
anti-spam measures, and upgraded to Mailman3 as part of this process.
- Tim
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
> * Changes in Slurm 23.11.4
> ==========================
> -- Fix a memory leak when updating partition nodes.
> -- Don't leave a partition around if it fails to create with scontrol.
> -- Fix segfault when creating partition with bad node list from scontrol.
> -- Fix preserving partition nodes on bad node list update from scontrol.
> -- Fix assertion in developer mode on a failed message unpack.
> -- Fix repeat POWER_DOWN requests making the nodes available for ping.
> -- Fix rebuilding job alias_list on restart when nodes are still powering up.
> -- Fix INVALID nodes running health check.
> -- Fix cloud/future nodes not setting addresses on invalid registration.
> -- scrun - Remove the requirement to set the SCRUN_WORKING_DIR environment
> variable. This was a regression in 23.11.
> -- Add warning for using select/linear with topology/tree.
> This combination will not be supported in the next major version.
> -- Fix health check program not being run after first pass of all nodes when
> using MaxNodeCount.
> -- sacct - Set process exit code to one for all errors.
> -- Add SlurmctldParameters=disable_triggers option.
> -- Fix issue running steps when the allocation requested an exclusive
> allocation shards along with shards.
> -- Fix cleaning up the sleep process and the cgroup of the extern step if
> slurm_spank_task_post_fork returns an error.
> -- slurm_completion - Add missing --gres-flags= options
> multiple-tasks-per-sharing and one-task-per-sharing.
> -- scrun - Avoid race condition that could cause outbound network
> communications to incorrectly rejected with an incomplete packet error.
> -- scrun - Gracefully handle kernel giving invalid expected number of incoming
> bytes for a connection causing incoming packet corruption resulting in
> connection getting closed.
> -- srun - return 1 when a step lauch fails
> -- scrun - Avoid race condition that could cause deadlock during shutdown.
> -- Fix scontrol listpids to work under dynamic node scenarios.
> -- Add --tres-bind to --help and --usage output.
> -- Add --gres-flags=allow-task-sharing to allow GPUs to still be accessible
> among all tasks when binding GPUs to specific tasks.
> -- Fix issue with CUDA_VISIBLE_DEVICES showing the same MIG device for all
> tasks when using MIGs with --tres-per-task or --gpus-per-task.
> -- slurmctld - Prevent a potential hang during shutdown/reconfigure if the
> association cache thread was previously shut down.
> -- scrun - Avoid race condition that could cause scrun to hang during
> shutdown when connections have pending events.
> -- scrun - Avoid excessive polling of connections during shutdown that could
> needlessly cause 100% CPU usage on a thread.
> -- sbcast - Use user identity from broadcast credential instead of looking it
> up locally on the node.
> -- scontrol - Remove "abort" option handling.
> -- Fix an error message referring to the wrong RPC.
> -- Fix memory leak on error when creating dynamic nodes.
> -- Fix a slurmctld segfault when a cloud/dynamic node changes hostname on
> registration.
> -- Prevent a slurmctld deadlock if the gpu plugin fails to load when
> creating a node.
> -- Change a slurmctld fatal() to an error() when attempting to create a
> dynamic node with a global autodetect set in gres.conf.
> -- Fix leaving node records on error when creating nodes with scontrol.
> -- scrun/sackd - Avoid race condition where shutdown could deadlock.
> -- Fix a regression in 23.02.5 that caused pam_slurm_adopt to fail when
> the user has multiple jobs on a node.
> -- Add GLOB_SILENCE flag that silences the error message which will display if
> an include directive attempts to use the "*" wildcard.
> -- Fix jobs getting rejected when submitting with --gpus option from older
> versions of job submission commands (23.02 and older).
> -- cgroup/v2 - Return 0 for VSZ. Kernel cgroups do not provide this metric.
> -- scrun - Avoid race condition where outbound RPCs could be corrupted.
> -- scrun - Avoid race condition that could cause a crash while compiled in
> debug mode.
> -- gpu/rsmi - Disable gpu usage statistics when not using ROCM 6.0.0+
> -- Fix stuck processes and incorrect environment when using --get-user-env.
> -- Avoid segfault in the slurmdbd when TrackWCKey=no but you are still using
> use WCKeys.
> -- Fix ctld segfault with TopologyParam=RoutePart and no partition defined.
> -- slurmctld - Fix missing --deadline handling for jobs not evaluated by the
> schedulers (i.e. non-runnable, skipped for other reasons, etc.).
> -- Demote some eio related logs from error to verbose in user commands. These
> are not generally actionable by the user and are easilly generated by port
> scanning a machine running srun.
> -- Make sprio correctly print array tasks that have not yet been split out.
> -- topology/block - Restrict the number of last-level blocks in any allocation.