[slurm-users] Slurm version 20.11.5 is now available

Tim Wickberg tim at schedmd.com
Tue Mar 16 22:16:56 UTC 2021


We are pleased to announce the availability of Slurm version 20.11.5.

This includes a number of moderate severity bug fixes, alongside a new 
job_container/tmpfs plugin developed by NERSC that can be used to create 
per-job filesystem namespaces.

Initial documentation for this plugin is available at:
https://slurm.schedmd.com/job_containe.conf.html
Slurm can be downloaded from https://www.schedmd.com/downloads.php .

- Tim

-- 
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support

> * Changes in Slurm 20.11.5
> ========================== >  -- Fix main scheduler bug where bf_hetjob_prio truncates 
SchedulerParameters.
>  -- Fix sacct not displaying UserCPU, SystemCPU and TotalCPU for large times.
>  -- scrontab - fix to return the correct index for a bad #SCRON option.
>  -- scrontab - fix memory leak when invalid option found in #SCRON line.
>  -- Add errno for when a user requests multiple partitions and they are using
>     partition based associations.
>  -- Fix issue where a job could run in a wrong partition when using
>     EnforcePartLimits=any and partition based associations.
>  -- Remove possible deadlock when adding associations/wckeys in multiple
>     threads.
>  -- When using PrologFlags=alloc make sure the correct Slurm version is set
>     in the credential.
>  -- When sending a job a warning signal make sure we always send SIGCONT
>     beforehand.
>  -- Fix issue where a batch job would continue running if a prolog failed on a
>     node that wasn't the batch host and requeuing was disabled.
>  -- Fix issue where sometimes salloc/srun wouldn't get a message about a prolog
>     failure in the job's stdout.
>  -- Requeue or kill job on a prolog failure when PrologFlags is not set.
>  -- Fix race condition causing node reboots to get requeued before
>     ResumeTimeout expires.
>  -- Preserve node boot_req_time on reconfigure.
>  -- Preserve node power_save_req_time on reconfigure.
>  -- Fix node reboots being queued and issued multiple times and preventing the
>     reboot to time out.
>  -- Fix debug message related to GrpTRESRunMin (AssocGrpCPURunMinutesLimit).
>  -- Fix run_command to exit correctly if track_script kills the calling thread.
>  -- Only requeue a job when the PrologSlurmctld returns nonzero.
>  -- When a job is signaled with SIGKILL make sure we flush all
>     prologs/setup scripts.
>  -- Handle burst buffer scripts if the job is canceled while stage_in is
>     happening.
>  -- When shutting down the slurmctld make note to ignore error message when
>     we have to kill a prolog/setup script we are tracking.
>  -- scrontab - add support for the --open-mode option.
>  -- acct_gather_profile/influxdb - avoid segfault on plugin shutdown if setup
>     has not completed successfully.
>  -- Reduce delay in starting salloc allocations when running with prologs.
>  -- Fix issue passing open fd's with [send|recv]msg.
>  -- Alter AllocNodes check to work if the allocating node's domain doesn't
>     match the slurmctld's. This restores the pre-20.11 behavior.
>  -- Fix slurmctld segfault if jobs from a prior version had the now-removed
>     INVALID_DEPEND state flag set and were allowed to run in 20.11.
>  -- Add job_container/tmpfs plugin to give a method to provide a private /tmp
>     per job.
>  -- Set the correct core affinity when using AutoDetect.
>  -- Start relying on the conf again in xcpuinfo_mac_to_abs().
>  -- Fix global_last_rollup assignment on job resizing.
>  -- slurmrestd - hand over connection context on _on_message_complete().
>  -- slurmrestd - mark "environment" as required for job submissions in schema.
>  -- slurmrestd - Disable credential reuse on the same TCP connection. Pipelined
>     HTTP connections will have to provide authentication with every request.
>  -- Avoid data conversion error on NULL strings in data_get_string_converted().
>  -- Handle situation where slurmctld is too slow processing
>     REQUEST_COMPLETE_BATCH_SCRIPT and it gets resent from the slurmstepd.



More information about the slurm-users mailing list