[slurm-users] Slurm version 23.02.1 is now available

Tue Mar 28 20:58:51 UTC 2023

We are pleased to announce the availability of Slurm version 23.02.1.

This includes several significant fixes to the upgrade process, 
including remote licenses allowed percentages being reset to 0 during 
the upgrade and a few issues during rolling upgrades.

Slurm can be downloaded from https://www.schedmd.com/downloads.php.

- Marshall

-- 
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support

> * Changes in Slurm 23.02.1
> ==========================
>  -- job_container/tmpfs - cleanup job container even if namespace mount is
>     already unmounted.
>  -- When cluster specific tables are be removed also remove the job_env_table
>     and job_script_table.
>  -- Fix the way bf_max_job_test is applied to job arrays in backfill.
>  -- data_parser/v0.0.39 - Avoid dumping -1 value or NULL when step's
>     consumed_energy is unset.
>  -- scontrol - Fix showing Array Job Steps.
>  -- scontrol - Fix showing Job HetStep.
>  -- openapi/dbv0.0.38 - Fix not displaying an error when updating QOS or
>     associations fails.
>  -- data_parser/v0.0.39 - Avoid crash while parsing composite structures.
>  -- sched/backfill - fix deleted planned node staying in planned node bitmap.
>  -- Fix nodes remaining as PLANNED after slurmctld save state recovery.
>  -- Fix parsing of cgroup.controllers file with a blank line at the end.
>  -- Add cgroup.conf EnableControllers option for cgroup/v2.
>  -- Get correct cgroup root to allow slurmd to run in containers like Docker.
>  -- Fix "(null)" cluster name in SLURM_WORKING_CLUSTER env.
>  -- slurmctld - add missing PrivateData=jobs check to step ContainerID lookup
>     requests originated from 'scontrol show step container-id=<id>' or certain
>     scrun operations when container state can't be directly queried.
>  -- Automatically sort the TaskPlugin list reverse-alphabetically. This
>     addresses an issue where cpu masks were reset if task/affinity was listed
>     before task/cgroup on cgroup/v2 systems with Linux kernel < 6.2.
>  -- Fix some failed terminate job requests from a 23.02 slurmctld to a 22.05 or
>     21.08 slurmd.
>  -- Fix compile issues on 32-bit systems.
>  -- Fix nodes un-draining after being drained due to unkillable step.
>  -- Fix remote licenses allowed percentages reset to 0 during upgrade.
>  -- sacct - Avoid truncating time strings when using SLURM_TIME_FORMAT with
>     the --parsable option.
>  -- data_parser/v0.0.39 - fix segfault when default qos is not set.
>  -- Fix regression in 22.05.0rc1 that broke Nodes=ALL in a NodeSet.
>  -- openapi/v0.0.39 - fix jobs submitted via slurmrestd being allocated fewer
>     CPUs than tasks when requesting multiple tasks.
>  -- Fix job not being scheduled on valid nodes and potentially being rejected
>     when using parentheses at the beginning of square brackets in a feature
>     request, for example: "feat1&[(feat2|feat3)]".
>  -- Fix a job being scheduled on nodes that do not match a feature request that
>     uses parentheses inside of brackets and requests additional features outside
>     of brackets, for example: "feat1&[feat2|(feat3|feat4)]".
>  -- Fix regression in 23.02.0rc1 which made --gres-flags=enforce-binding no
>     longer enforce optimal core-gpu job placement.
>  -- switch/hpe_slingshot - add option to disable VNI allocation per-job.
>  -- switch/hpe_slingshot - restrict CXI services to the requesting user.
>  -- switch/hpe_slingshot - Only output tcs once in SLINGSHOT_TCS env.
>  -- switch/hpe_slingshot - Fix updating LEs and ACs limits.
>  -- switch/hpe_slingshot - Use correct Max for EQs and CTs.
>  -- switch/hpe_slingshot - support configuring network options per-job.
>  -- switch/hpe_slingshot - retry destroying CXI service if necessary.
>  -- Fix memory leak caused by job preemption when licenses are configured.
>  -- mpi/pmix - Fix v5 to load correctly when libpmix.so isn't in the normal
>     lib path.
>  -- data_parser/v0.0.39 - fix regression where "memory_per_node" would be
>     rejected for job submission.
>  -- data_parser/v0.0.39 - fix regression where "memory_per_cpu" would be
>     rejected for job submission.
>  -- slurmctld - add an assert to check for magic number presence before deleting
>     a partition record and clear the magic afterwards to better diagnose
>     potential memory problems.
>  -- Clean up OCI containers task directories correctly.
>  -- slurm.spec - add "--with jwt" option.
>  -- scrun - Run under existing job when SLURM_JOB_ID is present.
>  -- Prevent a slurmstepd crash when the I/O subsystem has hung.
>  -- common/conmgr - fix memory leak of complete connection list.
>  -- data_parser/v0.0.39 - fix memory leak when parsing every field in a struct.
>  -- job_container/tmpfs - avoid printing extraneous error messages when running
>     a spank plugin that implements slurm_spank_job_prolog() or
>     slurm_spank_job_epilog().
>  -- Fix srun < 23.02 always getting an "exact" core allocation.
>  -- Prevent scontrol < 23.02 from setting MaxCPUsPerSocket to 0.
>  -- Add ScronParameters=explicit_scancel and corresponding scancel --cron
>     option.