[slurm-users] Upgrade from 20.11.0 to Slurm version 22.05.6 ?

Fri Nov 11 00:05:04 UTC 2022

We basically always do this. Just be mindful of how long it takes to upgrade your database (if you have that ability to do a dry run, you might ant to do that). That’s true of any upgrade, though.

If you have to skip more than one version, you’ll have to upgrade in stages.

On Nov 10, 2022, at 7:00 PM, Michael Gutteridge <michael.gutteridge at gmail.com<mailto:michael.gutteridge at gmail.com>> wrote:

Theoretically I think you should be able to.  Slurm should upgrade from the previous two releases (see this<https://slurm.schedmd.com/quickstart_admin.html#upgrade:~:text=Slurm%20permits%20upgrades%20to%20a%20new%20major%20release%20from%20the%20past%20two%20major%20releases%2C>) and I think that should include 20.11. (20.11 -> 21.08 -> 22.05).  Not something I've done though.

 - Michael

On Thu, Nov 10, 2022 at 2:15 PM Sid Young <sid.young at gmail.com<mailto:sid.young at gmail.com>> wrote:
Is there a direct upgrade path from  20.11.0 to 22.05.6 or is it in multiple steps?

Sid Young

On Fri, Nov 11, 2022 at 7:53 AM Marshall Garey <marshall at schedmd.com<mailto:marshall at schedmd.com>> wrote:
We are pleased to announce the availability of Slurm version 22.05.6.

This includes a fix to core selection for steps which could result in
random task launch failures, alongside a number of other moderate
severity issues.

- Marshall

--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support

> * Changes in Slurm 22.05.6
> ==========================
>  -- Fix a partition's DisableRootJobs=no from preventing root jobs from working.
>  -- Fix the number of allocated cpus for an auto-adjustment case in which the
>     job requests --ntasks-per-node and --mem (per-node) but the limit is
>     MaxMemPerCPU.
>  -- Fix POWER_DOWN_FORCE request leaving node in completing state.
>  -- Do not count magnetic reservation queue records towards backfill limits.
>  -- Clarify error message when --send-libs=yes or BcastParameters=send_libs
>     fails to identify shared library files, and avoid creating an empty
>     "<filename>_libs" directory on the target filesystem.
>  -- Fix missing CoreSpec on dynamic nodes upon slurmctld restart.
>  -- Fix node state reporting when using specialized cores.
>  -- Fix number of CPUs allocated if --cpus-per-gpu used.
>  -- Add flag ignore_prefer_validation to not validate --prefer on a job.
>  -- Fix salloc/sbatch SLURM_TASKS_PER_NODE output environment variable when the
>     number of tasks is not requested.
>  -- Permit using wildcard magic cookies with X11 forwarding.
>  -- cgroup/v2 - Add check for swap when running OOM check after task
>     termination.
>  -- Fix deadlock caused by race condition when disabling power save with a
>     reconfigure.
>  -- Fix memory leak in the dbd when container is sent to the database.
>  -- openapi/dbv0.0.38 - correct dbv0.0.38_tres_info.
>  -- Fix node SuspendTime, SuspendTimeout, ResumeTimeout being updated after
>     altering partition node lists with scontrol.
>  -- jobcomp/elasticsearch - fix data_t memory leak after serialization.
>  -- Fix issue where '*' wasn't accepted in gpu/cpu bind.
>  -- Fix SLURM_GPUS_ON_NODE for shared GPU gres (MPS, shards).
>  -- Add SLURM_SHARDS_ON_NODE environment variable for shards.
>  -- Fix srun error with overcommit.
>  -- Fix bug in core selection for the default cyclic distribution of tasks
>     across sockets, that resulted in random task launch failures.
>  -- Fix core selection for steps requesting multiple tasks per core when
>     allocation contains more cores than required for step.
>  -- gpu/nvml - Fix MIG minor number generation when GPU minor number
>     (/dev/nvidia[minor_number]) and index (as seen in nvidia-smi) do not match.
>  -- Fix accrue time underflow errors after slurmctld reconfig or restart.
>  -- Surpress errant errors from prolog_complete about being unable to locate
>     "node:(null)".
>  -- Fix issue where shards were selected from multiple gpus and failed to
>     allocate.
>  -- Fix step cpu count calculation when using --ntasks-per-gpu=.
>  -- Fix overflow problems when validating array index parameters in slurmctld
>     and prevent a potential condition causing slurmctld to crash.
>  -- Remove dependency on json-c in slurmctld when running with power saving.
>     Only the new "SLURM_RESUME_FILE" support relies on this, and it will be
>     disabled if json-c support is unavailable instead.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221111/a0dca738/attachment.htm>