[slurm-announce] Slurm version 23.02.4 is now available

Tim McMullan mcmullan at schedmd.com
Thu Jul 27 19:53:41 UTC 2023

We are pleased to announce the availability of Slurm version 23.02.4.

The 23.02.4 release includes a number of fixes to Slurm stability and 
various bug fixes.  Some notable fixes include fixing the main scheduler 
loop not starting on the backup controller after a failover event, a 
segfault when attempting to use AccountingStorageExternalHost, and an 
issue where steps could continue running indefinitely if the slurmctld 
takes too long to respond.

Slurm can be downloaded from https://www.schedmd.com/downloads.php .


Tim McMullan
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support

> * Changes in Slurm 23.02.4
> ==========================
>  -- Fix sbatch return code when --wait is requested on a job array.
>  -- switch/hpe_slingshot - avoid segfault when running with old libcxi.
>  -- Avoid slurmctld segfault when specifying AccountingStorageExternalHost.
>  -- Fix collected GPUUtilization values for acct_gather_profile plugins.
>  -- Fix slurmrestd handling of job hold/release operations.
>  -- Make spank S_JOB_ARGV item value hold the requested command argv instead of
>     the srun --bcast value when --bcast requested (only in local context).
>  -- Fix step running indefinitely when slurmctld takes more than MessageTimeout
>     to respond. Now, slurmctld will cancel the step when detected, preventing
>     following steps from getting stuck waiting for resources to be released.
>  -- Fix regression to make job_desc.min_cpus accurate again in job_submit when
>     requesting a job with --ntasks-per-node.
>  -- scontrol - Permit changes to StdErr and StdIn for pending jobs.
>  -- scontrol - Reset std{err,in,out} when set to empty string.
>  -- slurmrestd - mark environment as a required field for job submission
>     descriptions.
>  -- slurmrestd - avoid dumping null in OpenAPI schema required fields.
>  -- data_parser/v0.0.39 - avoid rejecting valid memory_per_node formatted as
>     dictionary provided with a job description.
>  -- data_parser/v0.0.39 - avoid rejecting valid memory_per_cpu formatted as
>     dictionary provided with a job description.
>  -- slurmrestd - Return HTTP error code 404 when job query fails.
>  -- slurmrestd - Add return schema to error response to job and license query.
>  -- Fix handling of ArrayTaskThrottle in backfill.
>  -- Fix regression in 23.02.2 when checking gres state on slurmctld startup or
>     reconfigure. Gres changes in the configuration were not updated on slurmctld
>     startup. On startup or reconfigure, these messages were present in the log:
>     "error: Attempt to change gres/gpu Count".
>  -- Fix potential double count of gres when dealing with limits.
>  -- switch/hpe_slingshot - support alternate traffic class names with "TC_"
>     prefix.
>  -- scrontab - Fix cutting off the final character of quoted variables.
>  -- Fix slurmstepd segfault when ContainerPath is not set in oci.conf
>  -- Change the log message warning for rate limited users from debug to verbose.
>  -- Fixed an issue where jobs requesting licenses were incorrectly rejected.
>  -- smail - Fix issues where e-mails at job completion were not being sent.
>  -- scontrol/slurmctld - fix comma parsing when updating a reservation's nodes.
>  -- cgroup/v2 - Avoid capturing log output for ebpf when constraining devices,
>     as this can lead to inadvertent failure if the log buffer is too small.
>  -- Fix --gpu-bind=single binding tasks to wrong gpus, leading to some gpus
>     having more tasks than they should and other gpus being unused.
>  -- Fix main scheduler loop not starting after failover to backup controller.
>  -- Added error message when attempting to use sattach on batch or extern steps.
>  -- Fix regression in 23.02 that causes slurmstepd to crash when srun requests
>     more than TreeWidth nodes in a step and uses the pmi2 or pmix plugin.
>  -- Reject job ArrayTaskThrottle update requests from unprivileged users.
>  -- data_parser/v0.0.39 - populate description fields of property objects in
>     generated OpenAPI specifications where defined.
>  -- slurmstepd - Avoid segfault caused by ContainerPath not being terminated by
>     '/' in oci.conf.
>  -- data_parser/v0.0.39 - Change v0.0.39_job_info response to tag exit_code
>     field as being complex instead of only an unsigned integer.
>  -- job_container/tmpfs - Fix %h and %n substitution in BasePath where %h was
>     substituted as the NodeName instead of the hostname, and %n was substituted
>     as an empty string.
>  -- Fix regression where --cpu-bind=verbose would override TaskPluginParam.
>  -- scancel - Fix --clusters/-M for federations. Only filtered jobs (e.g. -A,
>     -u, -p, etc.) from the specified clusters will be canceled, rather than all
>     jobs in the federation. Specific jobids will still be routed to the origin
>     cluster for cancellation.

More information about the slurm-announce mailing list