We are pleased to announce the availability of Slurm version 24.05.1.
This release addresses a number of minor-to-moderate issues since the 24.05 release was first announced a month ago.
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
- Tim
- Changes in Slurm 24.05.1
========================== -- Fix slurmctld and slurmdbd potentially stopping instead of performing a logrotate when recieving SIGUSR2 when using auth/slurm. -- switch/hpe_slingshot - Fix slurmctld crash when upgrading from 23.02. -- Fix "Could not find group" errors from validate_group() when using AllowGroups with large /etc/group files. -- Prevent an assertion in debugging builds when triggering log rotation in a backup slurmctld. -- Add AccountingStoreFlags=no_stdio which allows to not record the stdio paths of the job when set. -- slurmrestd - Prevent a slurmrestd segfault when parsing the crontab field, which was never usable. Now it explicitly ignores the value and emits a warning if it is used for the following endpoints: 'POST /slurm/v0.0.39/job/{job_id}' 'POST /slurm/v0.0.39/job/submit' 'POST /slurm/v0.0.40/job/{job_id}' 'POST /slurm/v0.0.40/job/submit' 'POST /slurm/v0.0.41/job/{job_id}' 'POST /slurm/v0.0.41/job/submit' 'POST /slurm/v0.0.41/job/allocate' -- mpi/pmi2 - Fix communication issue leading to task launch failure with "invalid kvs seq from node". -- Fix getting user environment when using sbatch with "--get-user-env" or "--export=" when there is a user profile script that reads /proc. -- Prevent slurmd from crashing if acct_gather_energy/gpu is configured but GresTypes is not configured. -- Do not log the following errors when AcctGatherEnergyType plugins are used but a node does not have or cannot find sensors: "error: _get_joules_task: can't get info from slurmd" "error: slurm_get_node_energy: Zero Bytes were transmitted or received" However, the following error will continue to be logged: "error: Can't get energy data. No power sensors are available. Try later" -- sbatch, srun - Set SLURM_NETWORK environment variable if --network is set. -- Fix cloud nodes not being able to forward to nodes that restarted with new IP addresses. -- Fix cwd not being set correctly when running a SPANK plugin with a spank_user_init() hook and the new "contain_spank" option set. -- slurmctld - Avoid deadlock during shutdown when auth/slurm is active. -- Fix segfault in slurmctld with topology/block. -- sacct - Fix printing of job group for job steps. -- scrun - Log when an invalid environment variable causes the job submission to be rejected. -- accounting_storage/mysql - Fix problem where listing or modifying an association when specifying a qos list could hang or take a very long time. -- gpu/nvml - Fix gpuutil/gpumem only tracking last GPU in step. Now, gpuutil/gpumem will record sums of all GPUS in the step. -- Fix error in scrontab jobs when using slurm.conf:PropagatePrioProcess=1. -- Fix slurmctld crash on a batch job submission with "--nodes 0,...". -- Fix dynamic IP address fanout forwarding when using auth/slurm. -- Restrict listening sockets in the mpi/pmix plugin and sattach to the SrunPortRange. -- slurmrestd - Limit mime types returned from query to 'GET /openapi/v3' to only return one mime type per serializer plugin to fix issues with OpenAPI client generators that are unable to handle multiple mime type aliases. -- Fix many commands possibly reporting an "Unexpected Message Received" when in reality the connection timed out. -- Prevent slurmctld from starting if there is not a json serializer present and the extra_constraints feature is enabled. -- Fix heterogeneous job components not being signaled with scancel --ctld and 'DELETE slurm/v0.0.40/jobs' if the job ids are not explicitly given, the heterogeneous job components match the given filters, and the heterogeneous job leader does not match the given filters. -- Fix regression from 23.02 impeding job licenses from being cleared. -- Move error to log_flag which made _get_joules_task error to be logged to the user when too many rpcs were queued in slurmd for gathering energy. -- For scancel --ctld and the associated rest api endpoints: 'DELETE /slurm/v0.0.40/jobs' 'DELETE /slurm/v0.0.41/jobs' Fix canceling the final array task in a job array when the task is pending and all array tasks have been split into separate job records. Previously this task was not canceled. -- Fix power_save operation after recovering from a failed reconfigure. -- slurmctld - Skip removing the pidfile when running under systemd. In that situation it is never created in the first place. -- Fix issue where altering the flags on a Slurm account (UsersAreCoords) several limits on the account's association would be set to 0 in Slurm's internal cache. -- Fix memory leak in the controller when relaying stepmgr step accounting to the dbd. -- Fix segfault when submitting stepmgr jobs within an existing allocation. -- Added "disable_slurm_hydra_bootstrap" as a possible MpiParams parameter in slurm.conf. Using this will disable env variable injection to allocations for the following variables: I_MPI_HYDRA_BOOTSTRAP, I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS, HYDRA_BOOTSTRAP, HYDRA_LAUNCHER_EXTRA_ARGS. -- scrun - Delay shutdown until after start requested. This caused scrun to never start or shutdown and hung forever when using --tty. -- Fix backup slurmctld potentially not running the agent when taking over as the primary controller. -- Fix primary controller not running the agent when a reconfigure of the slurmctld fails. -- slurmd - fix premature timeout waiting for REQUEST_LAUNCH_PROLOG with large array jobs causing node to drain. -- jobcomp/{elasticsearch,kafka} - Avoid sending fields with invalid date/time. -- jobcomp/elasticsearch - Fix slurmctld memory leak from curl usage -- acct_gather_profile/influxdb - Fix slurmstepd memory leak from curl usage -- Fix 24.05.0 regression not deleting job hash dirs after MinJobAge. -- Fix filtering arguments being ignored when using squeue --json. -- switch/nvidia_imex - Move setup call after spank_init() to allow namespace manipulation within the SPANK plugin. -- switch/nvidia_imex - Skip plugin operation if nvidia-caps-imex-channels device is not present rather than preventing slurmd from starting. -- switch/nvidia_imex - Skip plugin operation if job_container/tmpfs is configured due to incompatibility. -- switch/nvidia_imex - Remove any pre-existing channels when slurmd starts. -- rpc_queue - Add support for an optional rpc_queue.yaml configuration file.