We are pleased to announce the availability of Slurm version 23.11.4.
The 23.11.4 release includes a number of fixes to stability and various bug fixes. Some notable changes include that VSZ is no longer being reported when using cgroup/v2 (this is not provided by the kernel), a warning has been added if using select/linear and tolology/tree together as this will not be supported in the next major release, and a backwards compatibility issue that caused jobs using --gpus to be rejected when submitted from 23.02 or 22.05.
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
-Tim
- Changes in Slurm 23.11.4
========================== -- Fix a memory leak when updating partition nodes. -- Don't leave a partition around if it fails to create with scontrol. -- Fix segfault when creating partition with bad node list from scontrol. -- Fix preserving partition nodes on bad node list update from scontrol. -- Fix assertion in developer mode on a failed message unpack. -- Fix repeat POWER_DOWN requests making the nodes available for ping. -- Fix rebuilding job alias_list on restart when nodes are still powering up. -- Fix INVALID nodes running health check. -- Fix cloud/future nodes not setting addresses on invalid registration. -- scrun - Remove the requirement to set the SCRUN_WORKING_DIR environment variable. This was a regression in 23.11. -- Add warning for using select/linear with topology/tree. This combination will not be supported in the next major version. -- Fix health check program not being run after first pass of all nodes when using MaxNodeCount. -- sacct - Set process exit code to one for all errors. -- Add SlurmctldParameters=disable_triggers option. -- Fix issue running steps when the allocation requested an exclusive allocation shards along with shards. -- Fix cleaning up the sleep process and the cgroup of the extern step if slurm_spank_task_post_fork returns an error. -- slurm_completion - Add missing --gres-flags= options multiple-tasks-per-sharing and one-task-per-sharing. -- scrun - Avoid race condition that could cause outbound network communications to incorrectly rejected with an incomplete packet error. -- scrun - Gracefully handle kernel giving invalid expected number of incoming bytes for a connection causing incoming packet corruption resulting in connection getting closed. -- srun - return 1 when a step lauch fails -- scrun - Avoid race condition that could cause deadlock during shutdown. -- Fix scontrol listpids to work under dynamic node scenarios. -- Add --tres-bind to --help and --usage output. -- Add --gres-flags=allow-task-sharing to allow GPUs to still be accessible among all tasks when binding GPUs to specific tasks. -- Fix issue with CUDA_VISIBLE_DEVICES showing the same MIG device for all tasks when using MIGs with --tres-per-task or --gpus-per-task. -- slurmctld - Prevent a potential hang during shutdown/reconfigure if the association cache thread was previously shut down. -- scrun - Avoid race condition that could cause scrun to hang during shutdown when connections have pending events. -- scrun - Avoid excessive polling of connections during shutdown that could needlessly cause 100% CPU usage on a thread. -- sbcast - Use user identity from broadcast credential instead of looking it up locally on the node. -- scontrol - Remove "abort" option handling. -- Fix an error message referring to the wrong RPC. -- Fix memory leak on error when creating dynamic nodes. -- Fix a slurmctld segfault when a cloud/dynamic node changes hostname on registration. -- Prevent a slurmctld deadlock if the gpu plugin fails to load when creating a node. -- Change a slurmctld fatal() to an error() when attempting to create a dynamic node with a global autodetect set in gres.conf. -- Fix leaving node records on error when creating nodes with scontrol. -- scrun/sackd - Avoid race condition where shutdown could deadlock. -- Fix a regression in 23.02.5 that caused pam_slurm_adopt to fail when the user has multiple jobs on a node. -- Add GLOB_SILENCE flag that silences the error message which will display if an include directive attempts to use the "*" wildcard. -- Fix jobs getting rejected when submitting with --gpus option from older versions of job submission commands (23.02 and older). -- cgroup/v2 - Return 0 for VSZ. Kernel cgroups do not provide this metric. -- scrun - Avoid race condition where outbound RPCs could be corrupted. -- scrun - Avoid race condition that could cause a crash while compiled in debug mode. -- gpu/rsmi - Disable gpu usage statistics when not using ROCM 6.0.0+ -- Fix stuck processes and incorrect environment when using --get-user-env. -- Avoid segfault in the slurmdbd when TrackWCKey=no but you are still using use WCKeys. -- Fix ctld segfault with TopologyParam=RoutePart and no partition defined. -- slurmctld - Fix missing --deadline handling for jobs not evaluated by the schedulers (i.e. non-runnable, skipped for other reasons, etc.). -- Demote some eio related logs from error to verbose in user commands. These are not generally actionable by the user and are easilly generated by port scanning a machine running srun. -- Make sprio correctly print array tasks that have not yet been split out. -- topology/block - Restrict the number of last-level blocks in any allocation.