We are pleased to announce the availability of Slurm version 24.11.1.
This fixes a few possible crashes of the slurmctld and slurmrestd; a
regression in 24.11 which caused file transfers to a job with sbcast to
not join the job container namespace; mpi apps using Intel OPA, PSM2 and
OMPI 5.x when ran through srun; and various minor to moderate bugs.
Downloads are available at https://www.schedmd.com/downloads.php .
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
> * Changes in Slurm 24.11.1
> ==========================
> -- With client commands MIN_MEMORY will show mem_per_tres if specified.
> -- Fix errno message about bad constraint
> -- slurmctld - Fix crash and possible split brain issue if the
> backup controller handles an scontrol reconfigure while in control
> before the primary resumes operation.
> -- Fix stepmgr not getting dynamic node addrs from the controller
> -- stepmgr - avoid "Unexpected missing socket" errors.
> -- Fix `scontrol show steps` with dynamic stepmgr
> -- Deny jobs using the "R:" option of --signal if PreemptMode=OFF
> globally.
> -- Force jobs using the "R:" option of --signal to be preemptable
> by requeue or cancel only. If PreemptMode on the partition or QOS is off
> or suspend, the job will default to using PreemptMode=cancel.
> -- If --mem-per-cpu exceeds MaxMemPerCPU, the number of cpus per
> task will always be increased even if --cpus-per-task was specified. This
> is needed to ensure each task gets the expected amount of memory.
> -- Fix compilation issue on OpenSUSE Leap 15
> -- Fix jobs using more nodes than needed when not using -N
> -- Fix issue with allocation being allocated less resources
> than needed when using --gres-flags=enforce-binding.
> -- select/cons_tres - Fix errors with MaxCpusPerSocket partition
> limit. Used cpus/cores weren't counted properly, nor limiting free ones
> to avail, when the socket was partially allocated, or the job request
> went beyond this limit.
> -- Fix issue when jobs were preempted for licenses even if there
> were enough licenses available.
> -- Fix srun ntasks calculation inside an allocation when nodes are
> requested using a min-max range.
> -- Print correct number of digits for TmpDisk in sdiag.
> -- Fix a regression in 24.11 which caused file transfers to a job
> with sbcast to not join the job container namespace.
> -- data_parser/v0.0.40 - Prevent a segfault in the slurmrestd when
> dumping data with v0.0.40+complex data parser.
> -- Remove logic to force lowercase GRES names.
> -- data_parser/v0.0.42 - Prevent the association id from always
> being dumped as NULL when parsing in complex mode. Instead it will now
> dump the id. This affects the following endpoints:
> GET slurmdb/v0.0.42/association
> GET slurmdb/v0.0.42/associations
> GET slurmdb/v0.0.42/config
> -- Fixed a job requeuing issue that merged job entries into the
> same SLUID when all nodes in a job failed simultaneously.
> -- When a job completes, try to give idle nodes to reservations with
> the REPLACE flag before allowing them to be allocated to jobs.
> -- Avoid expensive lookup of all associations when dumping or
> parsing for v0.0.42 endpoints.
> -- Avoid expensive lookup of all associations when dumping or
> parsing for v0.0.41 endpoints.
> -- Avoid expensive lookup of all associations when dumping or
> parsing for v0.0.40 endpoints.
> -- Fix segfault when testing jobs against nodes with invalid gres.
> -- Fix performance regression while packing larger RPCs.
> -- Document the new mcs/label plugin.
> -- job_container/tmpfs - Fix Xauthoirty file being created
> outside the container when EntireStepInNS is enabled.
> -- job_container/tmpfs - Fix spank_task_post_fork not always
> running in the container when EntireStepInNS is enabled.
> -- Fix a job potentially getting stuck in CG on permissions
> errors while setting up X11 forwarding.
> -- Fix error on X11 shutdown if Xauthority file was not created.
> -- slurmctld - Fix memory or fd leak if an RPC is recieved that
> is not registered for processing.
> -- Inject OMPI_MCA_orte_precondition_transports when using PMIx. This fixes
> mpi apps using Intel OPA, PSM2 and OMPI 5.x when ran through srun.
> -- Don't skip the first partition_job_depth jobs per partition.
> -- Fix gres allocation issue after controller restart.
> -- Fix issue where jobs requesting cpus-per-gpu hang in queue.
> -- switch/hpe_slingshot - Treat HTTP status forbidden the same as
> unauthorized, allowing for a graceful retry attempt.