[slurm-users] Upgrade from 20.11.0 to Slurm version 22.05.6 ?

Thu Nov 10 22:12:36 UTC 2022

Is there a direct upgrade path from  20.11.0 to 22.05.6 or is it in
multiple steps?

Sid Young

On Fri, Nov 11, 2022 at 7:53 AM Marshall Garey <marshall at schedmd.com> wrote:

> We are pleased to announce the availability of Slurm version 22.05.6.
>
> This includes a fix to core selection for steps which could result in
> random task launch failures, alongside a number of other moderate
> severity issues.
>
> - Marshall
>
> --
> Marshall Garey
> Release Management, Support, and Development
> SchedMD LLC - Commercial Slurm Development and Support
>
> > * Changes in Slurm 22.05.6
> > ==========================
> >  -- Fix a partition's DisableRootJobs=no from preventing root jobs from
> working.
> >  -- Fix the number of allocated cpus for an auto-adjustment case in
> which the
> >     job requests --ntasks-per-node and --mem (per-node) but the limit is
> >     MaxMemPerCPU.
> >  -- Fix POWER_DOWN_FORCE request leaving node in completing state.
> >  -- Do not count magnetic reservation queue records towards backfill
> limits.
> >  -- Clarify error message when --send-libs=yes or
> BcastParameters=send_libs
> >     fails to identify shared library files, and avoid creating an empty
> >     "<filename>_libs" directory on the target filesystem.
> >  -- Fix missing CoreSpec on dynamic nodes upon slurmctld restart.
> >  -- Fix node state reporting when using specialized cores.
> >  -- Fix number of CPUs allocated if --cpus-per-gpu used.
> >  -- Add flag ignore_prefer_validation to not validate --prefer on a job.
> >  -- Fix salloc/sbatch SLURM_TASKS_PER_NODE output environment variable
> when the
> >     number of tasks is not requested.
> >  -- Permit using wildcard magic cookies with X11 forwarding.
> >  -- cgroup/v2 - Add check for swap when running OOM check after task
> >     termination.
> >  -- Fix deadlock caused by race condition when disabling power save with
> a
> >     reconfigure.
> >  -- Fix memory leak in the dbd when container is sent to the database.
> >  -- openapi/dbv0.0.38 - correct dbv0.0.38_tres_info.
> >  -- Fix node SuspendTime, SuspendTimeout, ResumeTimeout being updated
> after
> >     altering partition node lists with scontrol.
> >  -- jobcomp/elasticsearch - fix data_t memory leak after serialization.
> >  -- Fix issue where '*' wasn't accepted in gpu/cpu bind.
> >  -- Fix SLURM_GPUS_ON_NODE for shared GPU gres (MPS, shards).
> >  -- Add SLURM_SHARDS_ON_NODE environment variable for shards.
> >  -- Fix srun error with overcommit.
> >  -- Fix bug in core selection for the default cyclic distribution of
> tasks
> >     across sockets, that resulted in random task launch failures.
> >  -- Fix core selection for steps requesting multiple tasks per core when
> >     allocation contains more cores than required for step.
> >  -- gpu/nvml - Fix MIG minor number generation when GPU minor number
> >     (/dev/nvidia[minor_number]) and index (as seen in nvidia-smi) do not
> match.
> >  -- Fix accrue time underflow errors after slurmctld reconfig or restart.
> >  -- Surpress errant errors from prolog_complete about being unable to
> locate
> >     "node:(null)".
> >  -- Fix issue where shards were selected from multiple gpus and failed to
> >     allocate.
> >  -- Fix step cpu count calculation when using --ntasks-per-gpu=.
> >  -- Fix overflow problems when validating array index parameters in
> slurmctld
> >     and prevent a potential condition causing slurmctld to crash.
> >  -- Remove dependency on json-c in slurmctld when running with power
> saving.
> >     Only the new "SLURM_RESUME_FILE" support relies on this, and it will
> be
> >     disabled if json-c support is unavailable instead.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221111/d4b2a601/attachment.htm>