Slurm version 24.05.4 is now available and includes a fix for a recently
discovered security issue with the new stepmgr subsystem.
SchedMD customers were informed on October 9th and provided a patch on
request; this process is documented in our security policy. [1]
A mistake in authentication handling in stepmgr could permit an attacker
to execute processes under other users' jobs. This is limited to jobs
explicitly running with --stepmgr, or on systems that have globally
enabled stepmgr through "SlurmctldParameters=enable_stepmgr" in their
configuration. CVE-2024-48936.
Downloads are available at https://www.schedmd.com/downloads.php .
Release notes follow below.
- Tim
[1] https://www.schedmd.com/security-policy/
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
> * Changes in Slurm 24.05.4
> ==========================
> -- Fix generic int sort functions.
> -- Fix user look up using possible unrealized uid in the dbd.
> -- Fix FreeBSD compile issue with tls/none plugin.
> -- slurmrestd - Fix regressions that allowed slurmrestd to be run as SlurmUser
> when SlurmUser was not root.
> -- mpi/pmix fix race conditions with het jobs at step start/end which could
> make srun to hang.
> -- Fix not showing some SelectTypeParameters in scontrol show config.
> -- Avoid assert when dumping removed certain fields in JSON/YAML.
> -- Improve how shards are scheduled with affinity in mind.
> -- Fix MaxJobsAccruePU not being respected when MaxJobsAccruePA is set
> in the same QOS.
> -- Prevent backfill from planning jobs that use overlapping resources for the
> same time slot if the job's time limit is less than bf_resolution.
> -- Fix memory leak when requesting typed gres and --[cpus|mem]-per-gpu.
> -- Prevent backfill from breaking out due to "system state changed" every 30
> seconds if reservations use REPLACE or REPLACE_DOWN flags.
> -- slurmrestd - Make sure that scheduler_unset parameter defaults to true even
> when the following flags are also set: show_duplicates, skip_steps,
> disable_truncate_usage_time, run_away_jobs, whole_hetjob,
> disable_whole_hetjob, disable_wait_for_result, usage_time_as_submit_time,
> show_batch_script, and or show_job_environment. Additionaly, always make
> sure show_duplicates and disable_truncate_usage_time default to true when
> the following flags are also set: scheduler_unset, scheduled_on_submit,
> scheduled_by_main, scheduled_by_backfill, and or job_started. This effects
> the following endpoints:
> 'GET /slurmdb/v0.0.40/jobs'
> 'GET /slurmdb/v0.0.41/jobs'
> -- Ignore --json and --yaml options for scontrol show config to prevent mixing
> output types.
> -- Fix not considering nodes in reservations with Maintenance or Overlap flags
> when creating new reservations with nodecnt or when they replace down nodes.
> -- Fix suspending/resuming steps running under a 23.02 slurmstepd process.
> -- Fix options like sprio --me and squeue --me for users with a uid greater
> than 2147483647.
> -- fatal() if BlockSizes=0. This value is invalid and would otherwise cause the
> slurmctld to crash.
> -- sacctmgr - Fix issue where clearing out a preemption list using
> preempt='' would cause the given qos to no longer be preempt-able until set
> again.
> -- Fix stepmgr creating job steps concurrently.
> -- data_parser/v0.0.40 - Avoid dumping "Infinity" for NO_VAL tagged "number"
> fields.
> -- data_parser/v0.0.41 - Avoid dumping "Infinity" for NO_VAL tagged "number"
> fields.
> -- slurmctld - Fix a potential leak while updating a reservation.
> -- slurmctld - Fix state save with reservation flags when a update fails.
> -- Fix reservation update issues with parameters Accounts and Users, when
> using +/- signs.
> -- slurmrestd - Don't dump warning on empty wckeys in:
> 'GET /slurmdb/v0.0.40/config'
> 'GET /slurmdb/v0.0.41/config'
> -- Fix slurmd possibly leaving zombie processes on start up in configless when
> the initial attempt to fetch the config fails.
> -- Fix crash when trying to drain a non-existing node (possibly deleted
> before).
> -- slurmctld - fix segfault when calculating limit decay for jobs with an
> invalid association.
> -- Fix IPMI energy gathering with multiple sensors.
> -- data_parser/v0.0.39 - Remove xassert requiring errors and warnings to have a
> source string.
> -- slurmrestd - Prevent potential segfault when there is an error parsing an
> array field which could lead to a double xfree. This applies to several
> endpoints in data_parser v0.0.39, v0.0.40 and v0.0.41.
> -- scancel - Fix a regression from 23.11.6 where using both the --ctld and
> --sibling options would cancel the federated job on all clusters instead of
> only the cluster(s) specified by --sibling.
> -- accounting_storage/mysql - Fix bug when removing an association
> specified with an empty partition.
> -- Fix setting multiple partition state restore on a job correctly.
> -- Fix difference in behavior when swapping partition order in job submission.
> -- Fix security issue in stepmgr that could permit an attacker to execute
> processes under other users' jobs. CVE-2024-48936.
Available presentations from this year's SLUG event are now online.
They can be found at https://www.schedmd.com/publications/
We thank all those who presented and attended for a great event!
--
Victoria Hobson
SchedMD LLC
Vice President of Marketing