[slurm-users] Slurm versions 18.08.1 and 17.11.10 are now available
Tim Wickberg
tim at schedmd.com
Thu Oct 4 16:11:37 MDT 2018
We are pleased to announce the availability of Slurm versions 18.08.1
and 17.11.10.
This includes an extensive set of fixes made since 18.08.0 was released
at the end of August, and for 17.11.10 since 17.11.9 was released at the
start of August.
Please note that the 17.11.10 release is expected to be the the last
maintenance release of that series (barring any critical security
issues) as our support team has shifted their attention to the 18.08
release. Also note that support for 17.02 ended in August; SchedMD
customers are encourage to upgrade to a supported major release (18.08
or 17.11) at their earliest convenience.
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
- Tim
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
> * Changes in Slurm 18.08.1
> ==========================
> -- Remove commented-out parts of man pages related to cons_tres work in 19.05,
> as these were showing up on the web version due to a syntax error.
> -- Prevent slurmctld performance issues in main background loop if multiple
> backup controllers are unavailable.
> -- Add missing user read association lock in burst_buffer/cray during init().
> -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
> -- Fix creation of step hwloc xml file for after cpuset cgroup has been
> created.
> -- Add userspace as a valid default governor.
> -- Add timers to group_cache_lookup so if going slow advise
> LaunchParameters=send_gids.
> -- Fix SLURM_STEP_GRES=none to work correctly.
> -- Fix potential memory leak when a failure happens unpacking a ctld_multi_msg.
> -- Fix potential double free when a faulure happens when unpacking a
> node_registration_status_msg.
> -- Fix sacctmgr show runaways.
> -- Removed non-POSIX append operator from configure script for non-bash
> support.
> -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
> -- Fix sacct to not print huge reserve times when the job was never eligible.
> -- burst_buffer/cray - Add missing locks around assoc_mgr when timing out a
> burst buffer.
> -- burst_buffer/cray - Update burst buffers when an association or qos
> is removed from the system.
> -- Remove documentation for deprecated Cray/ALPS systems. Please switch to
> Native Cray mode instead.
> -- Completely copy features when copying the list in the slurmctld.
> -- PMIX - Fix issue with packing processes when using an arbitrary task
> distribution.
> -- Fix hostlists to be able to handle nodenames with '-' in them surrounded
> by integers.
> -- Added sort option to sprio output.
> -- Fix correct job CPU count allocated.
> -- Fix sacctmgr setting GrpJobs limit when setting GrpJobsAccrue limit.
> -- Change the defaults to MemLimitEnforce=no and NoOverMemoryKill
> (See RELEASE_NOTES).
> -- Prevent abort when using Cray node features plugin on non-knl.
> -- Add ability to reboot down nodes with scontrol reboot_nodes.
> -- Protect against sending to the slurmdbd if the connection has gone away.
> -- Fix invalid read when not using backup slurmctlds.
> -- Prevent acct coordinators from changing default acct on add user.
> -- Don't allow scontrol top do modify job priorities when priority == 1.
> -- slurmsmwd - change parsing code to handle systems with the svid or inst
> fields set in xtconsumer output.
> -- Fix infinite loop in slurmctld if GRES is specified without a count.
> -- sacct: Print error when unknown arguments are found.
> -- Fix checking missing return codes when unpacking structures.
> -- Fix slurm.spec-legacy including slurmsmwd
> -- More explicit error message when cgroup oom-kill events detected.
> -- When updating an association and are unable to find parent association
> initialize old fairshare association pointer correctly.
> -- Wrap slurm_cond_signal() calls with mutexes where needed.
> -- Fix correct timeout with resends in slurm_send_only_node_msg.
> -- Fix pam_slurm_adopt to honor action_adopt_failure.
> -- Have the slurmd recreate the hwloc xml file for the full system on restart.
> -- sdiag - correct the units for the gettimeofday() stat to microseconds.
> -- Set SLURM_CLUSTER_NAME environment variable in MailProg to the ClusterName.
> -- smail - use SLURM_CLUSTER_NAME environment variable.
> -- job_submit/lua - expose argc/argv options through lua interface.
> -- slurmdbd - prevent false-positive warning about innodb settings having
> been set too low if they're actually set over 2GB.
>
> * Changes in Slurm 17.11.10
> ===========================
> -- Move priority_sort_part_tier from slurmctld to libslurm to make it possible
> to run the regression tests 24.* without changing that code since it links
> directly to the priority plugin where that function isn't defined.
> -- Fix issue where job time limits can increase to max walltime when updating
> a job with scontrol.
> -- Fix invalid protocol_version manipulation on big endian platforms causing
> srun and sattach to fail.
> -- Fix for QOS, Reservation and Alias env variables in srun.
> -- mpi/pmi2 - Backport 6a702158b49c4 from 18.08 to avoid dangerous detached
> thread.
> -- When allowing heterogeneous steps make sure we copy all the options to
> avoid copying strings that may be overwritten.
> -- Print correctly when sh5util finds and empty file.
> -- Fix sh5util to not seg fault on exit.
> -- Fix sh5util to check correctly for H5free_memory.
> -- Adjust OOM monitoring function in task/cgroup to prevent problems in
> regression suite from leaked file descriptors.
> -- Fix issue with gres when defined with a type and no count
> (i.e. gres=gpu/tesla) it would get a count of 0.
> -- Allow sstat to talk to slurmd's that are new in protocol version.
> -- Permit database names over 33 characters in accounting_storage/mysql.
> -- Fix negative values when profiling.
> -- Fix srun segfault caused by invalid memory reads on the env.
> -- Fix segfault on job arrays when starting controller without dbd up.
> -- Fix pmi2 to build with gcc 8.0+.
> -- Fix proper alignment of clauses when determining if more nodes are needed
> for an allocation.
> -- Fix race condition when canceling a federation job that just started
> running.
> -- Prevent extra resources from being allocated when combining certain flags.
> -- Fix problem in task/affinity plugin that can lead to slurmd fatal()'ing
> when using --hint=nomultithread.
> -- Fix left over socket file when step is ending and using pmi2 with
> %n or %h in the spool dir.
> -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
> -- Fix sacct to not print huge reserve times when the job was never eligible.
> -- burst_buffer/cray - Add missing locks around assoc_mgr when timing out a
> burst buffer.
> -- burst_buffer/cray - Update burst buffers when an association or qos
> is removed from the system.
> -- If failed over to a backup controller, ensure the agent thread is launched
> to handle deferred tasks.
> -- Fix correct job CPU count allocated.
> -- Protect against sending to the slurmdbd if the connection has gone away.
> -- Fix checking missing return codes when unpacking structures.
> -- Fix slurm.spec-legacy including slurmsmwd
> -- More explicit error message when cgroup oom-kill events detected.
> -- When updating an association and are unable to find parent association
> initialize old fairshare association pointer correctly.
> -- Wrap slurm_cond_signal() calls with mutexes where needed.
> -- Fix correct timeout with resends in slurm_send_only_node_msg.
> -- Fix pam_slurm_adopt to honor action_adopt_failure.
> -- job_submit/lua - expose argc/argv options through lua interface.
More information about the slurm-users
mailing list