[slurm-announce] Slurm version 17.11.3 available

Tue Feb 6 16:15:07 MST 2018

We are pleased to announce the availability of Slurm version 17.11.3.

This includes over 44 fixes made since 17.11.2 was released last month, 
including one issue that can result in stray processes when a job is 
canceled during a long-running prolog script.

Slurm can be downloaded from https://www.schedmd.com/downloads.php

- Tim

> * Changes in Slurm 17.11.3
> ==========================
>  -- Send SIG_UME correctly to a step.
>  -- Sort sreport's reservation report by cluster, time_start, resv_name instead
>     of cluster, resv_name, time_start.
>  -- Avoid setting node in COMPLETING state indefinitely if the job initiating
>     the node reboot is cancelled while the reboot in in progress.
>  -- Scheduling fix for changing node features without any NodeFeatures plugins.
>  -- Improve logic when summarizing job arrays mail notifications.
>  -- Add scontrol -F/--future option to display nodes in FUTURE state.
>  -- Fix REASONABLE_BUF_SIZE to actually be 3/4 of MAX_BUF_SIZE.
>  -- When a job array is preempting make it so tasks in the array don't wait
>     to preempt other possible jobs.
>  -- Change free_buffer to FREE_NULL_BUFFER to prevent possible double free
>     in slurmstepd.
>  -- node_feature/knl_cray - Fix memory leaks that occur when slurmctld
>     reconfigured.
>  -- node_feature/knl_cray - Fix memory leak that can occur during normal
>     operation.
>  -- Fix srun environment variables for --prolog script.
>  -- Fix job array dependency with "aftercorr" option and some task arrays in
>     the first job fail. This fix lets all task array elements that can run
>     proceed rather than stopping all subsequent task array elements.
>  -- Fix potential deadlock in the slurmctld when using list_for_each.
>  -- Fix for possible memory corruption in srun when running heterogeneous job
>     steps.
>  -- Fix job array dependency with "aftercorr" option and some task arrays in
>     the first job fail. This fix lets all task array elements that can run
>     proceed rather than stopping all subsequent task array elements.
>  -- Fix output file containing "%t" (task ID) for heterogeneous job step to
>     be based upon global task ID rather than task ID for that component of the
>     heterogeneous job step.
>  -- MYSQL - Fix potential abort when attempting to make an account a parent of
>     itself.
>  -- Fix potentially uninitialized variable in slurmctld.
>  -- MYSQL - Fix issue for multi-dimensional machines when using sacct to
>     find jobs that ran on specific nodes.
>  -- Reject --acctg-freq at submit if invalid.
>  -- Added info string on sh5util when deleting an empty file.
>  -- Correct dragonfly topology support when job allocation specifies desired
>     switch count.
>  -- Fix minor memory leak on an sbcast error path.
>  -- Fix issues when starting the backup slurmdbd.
>  -- Revert uid check when requesting a jobid from a pid.
>  -- task/cgroup - add support to detect OOM_KILL cgroup events.
>  -- Fix whole node allocation cpu counts when --hint=nomultihtread.
>  -- Allow execution of task prolog/epilog when uid has access
>     rights by a secondary group id.
>  -- Validate command existence on the srun *[pro|epi]log options
>     if LaunchParameter test_exec is set.
>  -- Fix potential memory leak if clean starting and the TRES didn't change
>     from when last started.
>  -- Fix for association MaxWall enforcement when none is given at submission.
>  -- Add a job's allocated licenses to the [Pro|Epi]logSlurmctld.
>  -- burst_buffer/cray: Attempts by job to create persistent burst buffer when
>     one already exists owned by a different user will be logged and the job
>     held.
>  -- CRAY - Remove race in the core_spec where we add the slurmstepd to the
>     job container where if the step was canceled would also cancel the stepd
>     erroneously.
>  -- Make sure the slurmstepd blocks signals like SIGTERM correctly.
>  -- SPANK - When slurm_spank_init_post_opt() fails return error correctly.
>  -- When revoking a sibling job in the federation we want to send a start
>     message before purging the job record to get the uid of the revoked job.
>  -- Make JobAcctGatherParams options case-insensitive. Previously, UsePss
>     was the only correct capitialization; UsePSS or usepss were silently
>     ignored.
>  -- Prevent pthread_atfork handlers from being added unnecessarily after
>     'scontrol reconfigure', which can eventually lead to a crash if too
>     many handlers have been registered.
>  -- Better debug messages when MaxSubmitJobs is hit.
>  -- Docs - update squeue man page to describe all possible job states.
>  -- Preserve node features when slurmctld daemons reconfigured including active
>     and available KNL features.
>  -- Prevent orphaned step_extern steps when a job is cancelled while the
>     prolog is still running