From tim at schedmd.com Wed Jan 30 19:11:27 2019 From: tim at schedmd.com (Tim Wickberg) Date: Wed, 30 Jan 2019 12:11:27 -0700 Subject: [slurm-announce] Slurm versions 17.11.13 and 18.08.5 are now available (CVE-2019-6438) Message-ID: <5b9f383f-3b8b-6840-089d-31ba48668565@schedmd.com> Slurm versions 17.11.13 and 18.08.5 are now available, and include a series of recent bug fixes, as well as a fix for a security vulnerability (CVE-2019-6438) on 32-bit systems. We believe that 64-bit builds - the overwhelming majority of installations - of Slurm are not affected by this issue. Downloads are available at https://www.schedmd.com/downloads.php . While fixes are only available for the supported 17.11 and 18.08 releases, similar vulnerabilities affect 32-bit builds on past versions as well. The only resolution is to upgrade Slurm to a fixed release. SchedMD customers were informed on January 16th and provided a patch on request; this process is documented in our security policy [1]. Release notes follow below. - Tim [1] https://www.schedmd.com/security.php -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support From tim at schedmd.com Wed Jan 30 19:16:41 2019 From: tim at schedmd.com (Tim Wickberg) Date: Wed, 30 Jan 2019 12:16:41 -0700 Subject: [slurm-announce] Slurm versions 17.11.13 and 18.08.5 are now available (CVE-2019-6438) In-Reply-To: <5b9f383f-3b8b-6840-089d-31ba48668565@schedmd.com> References: <5b9f383f-3b8b-6840-089d-31ba48668565@schedmd.com> Message-ID: Forgot to attach the release notes, they are included below for reference: > * Changes in Slurm 18.08.5 > ========================== > -- Backfill - If a job has a time_limit guess the end time of a job better > if OverTimeLimit is Unlimited. > -- Fix "sacctmgr show events event=cluster" > -- Fix sacctmgr show runawayjobs from sibling cluster > -- Avoid bit offset of -1 in call to bit_nclear(). > -- Insure that "hbm" is a configured GresType on knl systems. > -- Fix NodeFeaturesPlugins=node_features/knl_generic to allow other gres > other than knl. > -- cons_res: Prevent overflow on multiply. > -- Better debug for bad values in gres.conf. > -- Fix double accounting of energy at end of job. > -- Read gres.conf for cloud nodes on slurmctld. > -- Don't assume the first node of a job is the batch host when purging jobs > from a node. > -- Better debugging when a job doesn't have a job_resrcs ptr. > -- Store ave watts in energy plugins. > -- Add XCC plugin for reading Lenovo Power. > -- Fix minor memory leak when scheduling rebootable nodes. > -- Fix debug2 prefix for sched log. > -- Fix printing correct SLURM_JOB_ACCOUNT_PACK_GROUP_* in env for a Het Job. > -- sbatch - search current working directory first for job script. > -- Make it so held jobs reset the AccrueTime and do not count against any > AccrueTime limits. > -- Add SchedulerParameters option of bf_hetjob_prio=[min|avg|max] to alter the > job sorting algorithm for scheduling heterogeneous jobs. > -- Fix initialization of assoc_mgr_locks and slurmctld_locks lock structures. > -- Fix segfault with job arrays using X11 forwarding. > -- Revert regression caused by e0ee1c7054 which caused negative values and > values starting with a decimal to be invalid for PriorityWeightTRES and > TRESBillingWeight. > -- Fix possibility to update a job's reservation to none. > -- Suppress connection errors to primary slurmdbd when backup dbd is active. > -- Suppress connection errors to primary db when backup db kicks in > -- Add missing fields for sacct --completion when using jobcomp/filetxt. > -- Fix incorrect values set for UserCPU, SystemCPU, and TotalCPU sacct fields > when JobAcctGatherType=jobacct_gather/cgroup. > -- Fixed srun from double printing invalid option msg twice. > -- Remove unused -b flag from getopt call in sbatch. > -- Disable reporting of node TRES in sreport. > -- Re-enabling features combined by OR within parenthesis for non-knl setups. > -- Prevent sending duplicate requests to reboot a node before ResumeTimeout. > -- Down nodes that don't reboot by ResumeTimeout. > -- Update seff to reflect API change from rss_max to tres_usage_in_max. > -- Add missing TRES constants from perl API. > -- Fix issue where sacct would return incorrect array tasks when querying > specific tasks. > -- Add missing variables to slurmdb_stats_t in the perlapi. > -- Fix nodes not getting reboot RPC when job requires reboot of nodes. > -- Fix failing update the partition list of a job. > -- Use slurm.conf gres ids instead of gres.conf names to get a gres type name. > -- Add mitigation for a potential heap overflow on 32-bit systems in xmalloc. > CVE-2019-6438. > * Changes in Slurm 17.11.13 > =========================== > -- Add mitigation for a potential heap overflow on 32-bit systems in xmalloc. > CVE-2019-6438. From tim at schedmd.com Thu Mar 7 21:39:56 2019 From: tim at schedmd.com (Tim Wickberg) Date: Thu, 7 Mar 2019 14:39:56 -0700 Subject: [slurm-announce] Slurm versions 18.08.6 is now available, as well as 19.05.0pre2, and Slurm on GCP update Message-ID: <7055ea52-07bf-364a-aaf3-5ac63aa40eb3@schedmd.com> We are pleased to announce the availability of Slurm version 18.08.6, as well as the second 19.05 release preview version 19.05.0pre2. The 18.08.6 includes over 50 fixes since the last maintenance release was made five weeks ago. The second preview of the 19.05 release - 19.05.0pre1 - is meant to highlight additional functionality coming with the new select/cons_tres plugin, alongside other recent development work. Please consult the RELEASE_NOTES file for a detailed list of changes made to date. Please note that preview releases are meant for testing and development only, and should not be used in production, are not supported, and that you cannot migrate to a newer release from these without potential loss of data and your job queues. I'd also like to call attention to some of our recent work in partnership with Google. There's a blog post today highlighting some of this recent work both on Slurm and with the slurm-gcp integration scripts (https://github.com/SchedMD/slurm-gcp): https://cloud.google.com/blog/products/compute/hpc-made-easy-announcing-new-features-for-slurm-on-gcp Slurm can be downloaded from https://www.schedmd.com/downloads.php . - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support > * Changes in Slurm 18.08.6 > ========================== > -- Added parsing of -H flag with scancel. > -- Fix slurmsmwd build on 32-bit systems. > -- acct_gather_filesystem/lustre - add support for Lustre 2.12 client. > -- Fix per-partition TRES factors/priority > -- Fix per-partition NICE priority > -- Fix partition access check validation for multi-partition job submissions. > -- Prevent segfault on empty response in 'scontrol show dwstat'. > -- node_features/knl_cray plugin - Preserve node's active features if it has > already booted when slurmctld daemon is reconfigured. > -- Detect missing burst buffer script and reject job. > -- GRES: Properly reset the topo_gres_cnt_alloc counter on slurmctld restart > to prevent underflow. > -- Avoid errors from packing accounting_storage_mysql.so when RPM is built > with out mysql support. > -- Remove deprecated -t option from slurmctld --help. > -- acct_gather_filesystem/lustre - fix stats gathering. > -- Enforce documented default usage start and end times when querying jobs from > the database. > -- Fix issues when querying running jobs from the database. > -- Deny sacct request where start time is later than the end time requested. > -- Fix sacct verbose about time and states queried. > -- burst_buffer/cray - allow 'scancel --hurry ' to tear down a burst > buffer that is currently staging data out. > -- X11 forwarding - allow setup if the DISPLAY environment variable lacks > a screen number. (Permit both "localhost:10.0" and "localhost:10".) > -- docs - change HTML title to include the page title or man page name. > -- X11 forwarding - fix an unnecessary error message when using the > local_xauthority X11Parameters option. > -- Add use_raw_hostname to X11Parameters. > -- Fix smail so it passes job arrays to seff correctly. > -- Don't check InactiveLimit for salloc --no-shell jobs. > -- Add SALLOC_GRES and SBATCH_GRES as input to salloc/sbatch. > -- Remove drain state when node doesn't reboot by ResumeTimeout. > -- Fix considering "resuming" nodes in scheduling. > -- Do not kill suspended jobs due to exceeding time limit. > -- Add NoAddrCache CommunicationParameter. > -- Don't ping powering up cloud nodes. > -- Add cloud_dns SlurmctldParameter. > -- Consider --sbindir configure option as the default path to find slurmstepd. > -- Fix node state printing of DRAINED$ > -- Fix spamming dbd of down/drained nodes in maintenance reservation. > -- Avoid buffer overflow in time_str2secs. > -- Calculate suspended time for suspended steps. > -- Add null check for step_ptr->step_node_bitmap in _pick_step_nodes. > -- Fix multi-cluster srun issue after 'scontrol reconfigure' was called. > -- Fix accessing response_cluster_rec outside of write locks. > -- Fix Lua user messages not showing up on rejected submissions. > -- Fix printing multi-line error messages on rejected submissions. From tim at schedmd.com Fri Apr 12 04:27:03 2019 From: tim at schedmd.com (Tim Wickberg) Date: Fri, 12 Apr 2019 12:27:03 +0800 Subject: [slurm-announce] Slurm version 18.08.7 is now available Message-ID: We are pleased to announce the availability of Slurm version 18.08.7. This includes over 20 fixes since 18.08.6 was released last month, include one for a regression that caused issues with 'sacct -J' not returning results correctly. Slurm can be downloaded from https://www.schedmd.com/downloads.php . - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support > * Changes in Slurm 18.08.7 > ========================== > -- Set debug statement to debug2 to avoid benign error messages. > -- Add SchedulerParameters option of bf_hetjob_immediate to attempt to start > a heterogeneous job as soon as all of its components are determined able to > do so. > -- Fix underflow causing decay thread to exit. > -- Fix main scheduler not considering hetjobs when building the job queue. > -- Fix regression for sacct to display old jobs without a start time. > -- Fix setting correct number of gres topology bits. > -- Update hetjobs pending state reason when appropriate. > -- Fix accounting_storage/filetxt's understanding of TRES. > -- Set Accrue time when not enforcing limits. > -- Fix srun segfault when requesting a hetjob with test_exec or bcast options. > -- Hide multipart priorities log message behind Priority debug flag. > -- sched/backfill - Make hetjobs sensitive to bf_max_job_start. > -- Fix slurmctld segfault due to job's partition pointer NULL dereference. > -- Fix issue with OR'ed job dependencies. > -- Add new job's bit_flags of INVALID_DEPEND to prevent rebuilding a job's > dependency string when it has at least one invalid and purged dependency. > -- Promote federation unsynced siblings log message from debug to info. > -- burst_buffer/cray - fix slurmctld SIGABRT due to illegal read/writes. > -- burst_buffer/cray - fix memory leak due to unfreed job script content. > -- node_features/knl_cray - fix script_argv use-after-free. > -- burst_buffer/cray - fix script_argv use-after-free. > -- Fix invalid reads of size 1 due to non null-terminated string reads. > -- Add extra debug2 logs to identify why BadConstraints reason is set. From tim at schedmd.com Tue Apr 30 22:15:11 2019 From: tim at schedmd.com (Tim Wickberg) Date: Tue, 30 Apr 2019 16:15:11 -0600 Subject: [slurm-announce] Slurm release candidate version 19.05.0rc1 available for testing Message-ID: <9e94498b-a82b-1e44-c47f-66f386f3236e@schedmd.com> We are pleased to announce the availability of Slurm release candidate version 19.05.0rc1. This is the first release candidate version of the upcoming 19.05 release series, and represents the end of development for the release cycle, and a finalization of the RPC and state file formats. If any issues are identified with this new release candidate, please report them through https://bugs.schedmd.com against the 19.05.x version and we will address them before the first production 19.05.0 release is made. Please note that the release candidates are not intended for production use. Barring any late-discovered issues, the state file formats should not change between now and 19.05.0 and are considered frozen at this time for the 19.05 release. A preview of the updated documentation can be found at https://slurm.schedmd.com/archive/slurm-master/ . Slurm can be downloaded from https://www.schedmd.com/downloads.php . - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support From tim at schedmd.com Tue May 28 17:54:00 2019 From: tim at schedmd.com (Tim Wickberg) Date: Tue, 28 May 2019 13:54:00 -0400 Subject: [slurm-announce] Slurm version 19.05.0 is now available Message-ID: After 9 months of development and testing we are pleased to announce the availability of Slurm version 19.05.0! Downloads are available from https://www.schedmd.com/downloads.php. Highlights of the 19.05 release include: - The new select/cons_tres plugin, which introduces new GPU-specific job submission options, and extends Slurm's backfill scheduling logic to cover resources beyond just cpus and memory. - A new NSS library - nss_slurm - has been developed, which can provide directory info for the job step's user to local processes. - Heterogeneous Job support on Cray Aries systems. - A new "Association" priority factor, and corresponding PriorityWeightAssoc setting, providing for an alternative approach to establishing relative priority values between groups. - Two new plugin APIs intended for sites to customize their Slurm installations: cli_filter and site_factor. Thank you to all customers, partners, and community members who contributed to getting this release done. As with past releases, the documentation available at https://slurm.schedmd.com has been updated to the 19.05 release. Past versions are available in the archive. This release also marks the end of support for the 17.11 release. The 18.08 release will remain supported up until the 20.02 release in February, but will stop receiving as frequent updates, and bug-fixes will be targeted for the 19.05 maintenance releases going forward. -- Tim Wickberg Chief Technology Officer, SchedMD Commercial Slurm Development and Support From tim at schedmd.com Mon Jun 24 18:34:04 2019 From: tim at schedmd.com (Tim Wickberg) Date: Mon, 24 Jun 2019 12:34:04 -0600 Subject: [slurm-announce] [slurm-users] Call for Abstracts - 2019 Slurm User Group Meeting In-Reply-To: References: Message-ID: This is a combined reminder and extension for the Call for Abstracts for presentations for the 2019 Slurm User Group Meeting. The deadline is now extended by an additional week - abstracts must now be received by July 5th for consideration. As an additional reminder, early registration will end on July 14th, after which time the registration fee will increase to the standard rate. Please contact Jacob with abstract submissions, or any questions. - Tim On 05/14/2019 02:02 PM, Jacob Jenson wrote: > You are invited to submit an abstract of a tutorial, technical > presentation or site report to be given at the 2019 Slurm User > Group Meeting. This event is sponsored and organized by the University > of Utah and SchedMD. This international event is opened to those who > want to: > > * Learn more about Slurm, a highly scalable Resource Manager and Job > Scheduler > * Share their knowledge and experience with other users and > administrators > * Get detailed information about the latest features and developments > * Share requirements and discuss future developments > > Everyone who wants to present their own usage, developments, site > report, or tutorial about Slurm is invited to send an abstract to > slugc at schedmd.com > > *Important Dates:* > 28 June 2019: Abstracts due > 12 July 2019: Notification of acceptance > > *Slurm User Group Meeting 2019* > 17-18 September 2019 > Salt Lake City Utah On 05/14/2019 02:03 PM, Jacob Jenson wrote: > Registration for the 2019 Slurm User Group Meeting is open. You can register at https://slug19.eventbrite.com/ > > The meeting will be held on 17-18 September 2019 in Salt Lake City at the University of Utah > > Early registration > May 14 through July 14 > $300 USD > Standard registration > July 15 through August 15 > $375 USD > Late registration > August 16 through August 31 > $600 USD > > A block of rooms have been reserved a the University Guest House for those attending. The University Guest house is conveniently located next to the conference meeting room. You can reserve a room at the Guest House by calling +1-888-416-4075 by August 16, 2019. Please mention the group name “Slurm User Group 2019” in order to receive a discounted room rate. Online reservations can be made at https://www.universityguesthouse.com/University-Guest-House > > Please contact me with any questions regarding the 2019 Slurm User Group Meeting. > > Jacob From tim at schedmd.com Wed Jul 10 19:27:12 2019 From: tim at schedmd.com (Tim Wickberg) Date: Wed, 10 Jul 2019 13:27:12 -0600 Subject: [slurm-announce] Slurm versions 19.05.1 and 18.08.8 are now available (CVE-2019-12838) Message-ID: <70e61882-3c93-0323-c08a-20a5ec3112b3@schedmd.com> Slurm versions 19.05.1 and 18.08.8 are now available, and include a series of recent bug fixes, as well as a fix for a security vulnerability (CVE-2019-12838) related to the 'sacctmgr archive load' functionality. While fixes are only available for the currently supported 19.05 and 18.08 releases, similar vulnerabilities affect past versions as well and sites are encourage to upgrade to a supported version. SchedMD customers were informed on June 26th and provided a patch on request; this process is documented in our security policy [1]. Downloads are available at https://www.schedmd.com/downloads.php . Release notes follow below. - Tim [1] https://www.schedmd.com/security.php -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support > * Changes in Slurm 19.05.1 > ========================== > -- accounting_storage/mysql - fix incorrect function names in error messages. > -- accounting_storage/slurmdbd - trigger an fsync() on the dbd.messages state > file to ensure it is committed to disk properly. > -- Avoid JobHeldUser state reason from being updated at allocation time. > -- Fix dump/load of rejected heterogeneous jobs. > -- For heterogeneous jobs, do not count the each component against the QOS or > association job limit multiple times. > -- Comment out documentation for the incomplete and currently unusable > burst_buffer/generic plugin. > -- Add new error ESLURM_INVALID_TIME_MIN_LIMIT to make note when a time_min > limit is invalid based on timelimit. > -- Correct slurmdb cluster record pack with NULL pointer input. > -- Clearer error message for ESLURM_INVALID_TIME_MIN_LIMIT. > -- Fix SchedulerParameter bf_min_prio_reserve error when not the last parameter > -- When fixing runaway jobs, change to reroll from earliest submit time, and > never reroll from Unix epoch. > -- Display submit time when running sacctmgr show runawayjobs and add format > option to display eligible time. > -- jobcomp/elasticsearch - fix minor race related to JobCompLoc setup. > -- For HetJobs, ensure SLURM_PACK_JOB_ID is set regardless of whether > PrologFlags=Alloc is enabled. > -- Fix PriorityFlags regression with the mutation of FAIR_TREE to NO_FAIR_TREE. > -- select/cons_res - fix debug flag SelectType handling in select_p_job_test. > -- Fix sacctmgr archive dump commit confirmation. > -- Prevent extra resources from being allocated when combining certain flags. > -- Cray - fix template generator with update cray_aries plugin names. > -- accounting_storage/slurmdbd - provide additional detail in several error > messages. > -- Backfill - If a job has a time_limit guess the end time of a job better > if OverTimeLimit is Unlimited. > -- Remove premature call to get system gpus before querying fake gpus that > should override the real. > -- Fix segfault in epilog_set_env() when gres_devices is NULL. > -- Fix (un)supported states in sacct. > -- Adjust build system to no longer use the AC_FUNC_MALLOC autoconf macro. > -- srun - restore the --cpu_bind option to srun. > -- Add UsageFactorSafe QOS flag to control applying UsageFactor at > submission/scheduling time. > -- Create missing reservations on DBD_MODIFY_RESV. > -- Add error message when attempting to update association manager and object > doesn't exist. > -- Fix security issue in accounting_storage/mysql plugin on archive file loads > by always escaping strings within the slurmdbd. CVE-2019-12838. > * Changes in Slurm 18.08.7 > ========================== > -- Set debug statement to debug2 to avoid benign error messages. > -- Add SchedulerParameters option of bf_hetjob_immediate to attempt to start > a heterogeneous job as soon as all of its components are determined able to > do so. > -- Fix underflow causing decay thread to exit. > -- Fix main scheduler not considering hetjobs when building the job queue. > -- Fix regression for sacct to display old jobs without a start time. > -- Fix setting correct number of gres topology bits. > -- Update hetjobs pending state reason when appropriate. > -- Fix accounting_storage/filetxt's understanding of TRES. > -- Set Accrue time when not enforcing limits. > -- Fix srun segfault when requesting a hetjob with test_exec or bcast options. > -- Hide multipart priorities log message behind Priority debug flag. > -- sched/backfill - Make hetjobs sensitive to bf_max_job_start. > -- Fix slurmctld segfault due to job's partition pointer NULL dereference. > -- Fix issue with OR'ed job dependencies. > -- Add new job's bit_flags of INVALID_DEPEND to prevent rebuilding a job's > dependency string when it has at least one invalid and purged dependency. > -- Promote federation unsynced siblings log message from debug to info. > -- burst_buffer/cray - fix slurmctld SIGABRT due to illegal read/writes. > -- burst_buffer/cray - fix memory leak due to unfreed job script content. > -- node_features/knl_cray - fix script_argv use-after-free. > -- burst_buffer/cray - fix script_argv use-after-free. > -- Fix invalid reads of size 1 due to non null-terminated string reads. > -- Add extra debug2 logs to identify why BadConstraints reason is set. From tim at schedmd.com Tue Aug 13 19:11:02 2019 From: tim at schedmd.com (Tim Wickberg) Date: Tue, 13 Aug 2019 13:11:02 -0600 Subject: [slurm-announce] Slurm version 19.05.2 is now available Message-ID: <1c3561d4-a17d-f575-a863-c03daa9a288b@schedmd.com> Slurm version 19.05.2 is now available, and includes a series of minor bug fixes since 19.05.1 was released over a month ago. Downloads are available at https://www.schedmd.com/downloads.php . Release notes follow below. - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support > * Changes in Slurm 19.05.2 > ========================== > -- Wrap END_TIMER{,2,3} macro definition in "do {} while (0)" block. > -- Allow account coordinators to add users who don't already have an > association with any account. > -- If only allowing particular alloc nodes in a partition, deny any request > coming from an alloc node of NULL. > -- Prevent partial-load of plugins which can leave certain interfaces in > an inconsistent state. > -- Remove stray __USE_GNU macro definitions from source. > -- Fix loading fed state by backup on subsequent takeovers. > -- Add missing job read lock when loading fed job state. > -- Add missing fed_job_info jobs if fed state is lost. > -- Do not build cgroup plugins on FreeBSD or NetBSD, and use proctrack/pgid > by default instead. > -- Do not build switch/cray_aries plugin on FreeBSD, NetBSD, or macOS. > -- Fix build on FreeBSD. > -- Fix race condition in route/topology plugin. > -- In munge decode set the alloc_node field to the text representation of an > IP address if the reverse lookup fails. > -- Fix infinite loop in slurmstepd handling for nss_slurm REQUEST_GETGR RPC. > -- Fix slurmstepd early assertion fail which prevented batch job launch or > tasks launch on non-Linux systems. > -- Fix regression with SLURM_STEP_GPUS env var being renamed SLURM_STEP_GRES. > -- Fix pmix v3 linking if no rpath is allowed on build. > -- Fix sacctmgr error handling when removing associations and users. > -- Allow sacctmgr to add users to WCKeys without having TrackWCKey set in the > slurm.conf. > -- Allow sacctmgr to delete WCKeys from users. > -- Change GRES type set by gpu/gpu_nvml plugin to be more specific - based > on device name instead of brand name. > -- cli_filter - fix logic error with option lookup functions. > -- Fix bad testing of NodeFeatures debug flag in contribs/cray. > -- Cleanup track_script code to avoid race conditions and invalid memory > access. > -- Fix jobs being killed after being requeued by preemption. > -- Make register nodes verify correctly when using cons_tres. > -- Fix srun --mem-per-cpu being ignored. > -- Fix segfault in _update_job() under certain conditions. > -- job_submit/lua - restore slurm.FAILURE as a synonym for slurm.ERROR. From tim at schedmd.com Thu Oct 3 19:43:47 2019 From: tim at schedmd.com (Tim Wickberg) Date: Thu, 3 Oct 2019 13:43:47 -0600 Subject: [slurm-announce] Slurm version 19.05.3 is now available Message-ID: <17337d8e-f87a-ac96-d5d3-7a72ce7f8666@schedmd.com> Slurm version 19.05.3 is now available, and includes a series of fixes since 19.05.2 was released nearly two months ago. Downloads are available at https://www.schedmd.com/downloads.php . Release notes follow below. - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support > * Changes in Slurm 19.05.3 > ========================== > -- Fix missing check from conversion of cray -> cray_aries. > -- Improve job state reason string when required nodes are not available by > not including those that don't belong to the job partition. > -- Set a more appropriate ESLURM_RESERVATION_MAINT job state reason for jobs > requesting feature(s) and required nodes are in a maintenance reservation. > -- Fix logic to better handle maintenance reservations. > -- Add spank options to cache in remote callback. > -- Enforce the use of spank_option_getopt(). > -- Fix select plugins' will run test under-allocating nodes usage for > completing jobs. > -- Nodes in COMPLETING state treated as being currently available for job > will-run test. > -- Cray - fix contribs slurm.conf.j2 with updated cray_aries plugin names. > -- job_submit/lua - fix problem where nil was expected for min_mem_per_cpu. > -- Fix extra, unaccounted TRESRunMins usage created by heterogeneous jobs when > running with the priority/multifactor plugin. > -- Detach threads once they are done to avoid having to join them > in track scripts code. > -- Handle situation where a slurmctld tries to communicate with slurmdbd more > than once at the same time. > -- Fix XOR/XAND features like cpu&fastio&[knl|westmere] to be resolved > correctly. > -- Don't update [min|max]_exit_code on job array task requeue. > -- Don't assume the first node of a job is the batch host when testing if the > job's allocated nodes are booted/ready. > -- Make --batch= requests wait for all nodes to be booted so that it > can choose the batch host after the nodes have been booted -- possibly with > different features. > -- Fix talking to batch host on it's protocol version when using --batch. > -- gres/mic plugin - add missing fini() function to clean up plugin state. > -- Move _validate_node_choice() before prolog/epilog check. > -- Look forward one week while create new reservation. > -- Set mising resv_desc.flags before call _select_nodes(). > -- Use correct start_time for TIME_FLOAT reservation in _job_overlap(). > -- Properly enforce a job's mem-per-cpu option when allocate the node > exclusively to that job. > -- sched/backfill - clear estimated sched_nodes as done for start_time. > -- Have safe_[read|write] handle EAGAIN and EINTR. > -- Fix checking for flag with logical AND. > -- Correct "extern" definition of variable if compiling with __APPLE__. > -- Deprecate FastSchedule. FastSchedule will be removed in 20.02. > The FastSchedule=2 functionality (used for testing and development) has > been retained as the new SlurmdParameters=config_overrides option. > -- Fix preemption issue when picking nodes for a feature job request. > -- Fix race condition preventing held array job from getting a db_index. > -- Fix select/cons_tres gres code infinite loop leaving slurmctld unresponsive. > -- Remove redefinition of global variable in gres.c > -- Fix issue where GPU devices are denied access when MPS is enabled. > -- Fix uninitialized errors when compiling with CFLAGS="--coverage". > -- Fix scancel --full for proctrack/cgroups. > -- Fix sdiag backfill last and mean queue length stats. > -- Do not remove batch host when resizing/shrinking a batch job. > -- nss_slurm - fix file descriptor leaks. > -- Fix preemption for jobs using complex feature requests > (e.g. -C "[rack1*2&rack2*4]"). > -- Fix memory leaks in preemption when jobs request multiple features. > -- Allow Operator users to show/fix runaways. > -- Disallow coordinators to show/fix runaways. > -- mpi/pmi2 - increase array len to avoid buffer size exceeded error. > -- Preserve rebooting node's nextstate when updating state with scontrol. > -- Fully merge slurm.conf and gres.conf before node_config_load(). > -- Remove FastSchedule dependence from gres.conf's AutoDetect=nvml. > -- Forbid mix of typed and untyped GRES of same name in slurm.conf. > -- cons_tres: Prevent creating a job without CPUs. > -- Prevent underflow when filtering cores with gres. > -- proctrack/cray_aries: use current pid instead of thread if we're in a fork. > -- Fix missing check for prolog launch credential creation failure that can > lead to segfaults From tim at schedmd.com Wed Oct 16 02:21:20 2019 From: tim at schedmd.com (Tim Wickberg) Date: Tue, 15 Oct 2019 20:21:20 -0600 Subject: [slurm-announce] Slurm User Group 2019 (SLUG19) presentations online, SC19 Message-ID: <136c282d-145c-4791-bf0d-60cba58fdc4b@schedmd.com> Many thanks to all the attendees, and especially to all those who presented at the Slurm User Group 2019 meeting in Salt Lake City. Thank you to the University of Utah as well for hosting. I hope to see many of you again at SLUG'20, which at Harvard University on September 15-16, 2020. PDFs of the presentations are online at http://slurm.schedmd.com/publications.html For those of you who will be at SC19 in Denver - we hope to see you at the Slurm booth (#1571), and at the Slurm "Birds of a Feather" session on Thursday, November 21st, from 12:15 - 1:15pm, in rooms 401/402/403/404. As always, there will be a number of presentations in the Slurm booth - please check the display in the booth for the full schedule. - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support From tim at schedmd.com Thu Nov 14 22:21:11 2019 From: tim at schedmd.com (Tim Wickberg) Date: Thu, 14 Nov 2019 15:21:11 -0700 Subject: [slurm-announce] Slurm version 19.05.4 is now available, SC19 Message-ID: <808a7048-b174-6915-705b-8975e9807f49@schedmd.com> Slurm version 19.05.4 is now available, and includes a series of fixes since 19.05.3 was released last month ago. Downloads are available at https://www.schedmd.com/downloads.php . Release notes follow below. For those of you who will be at SC19 in Denver: we hope to see you at the Slurm booth (#1571), and at the Slurm "Birds of a Feather" session on Thursday, November 21st, from 12:15 - 1:15pm, in rooms 401/402/403/404. As always, there will be a number of presentations at the Slurm booth - please check the display in the booth for the schedule. - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support > * Changes in Slurm 19.05.4 > ========================== > -- Don't allow empty string as a reservation name; generate a name if empty > string is provided. > -- Fix salloc segfault when using --no-shell option. > -- Fix divide by zero when normalizing partition priorities. > -- Restore ability to set JobPriorityFactor to 0 on a partition. > -- Fix multi-partition non-normalized job priorities. > -- Adjust precedence between --mem-per-cpu and --mem-per-node to enforce > them as mutually exclusive. Specifying either on the command line will > now explicitly override any value inherited through the environment. > -- Always print node's version, if it exists, in scontrol show nodes. > -- sbatch - ensure SLURM_NTASKS_PER_NODE is exported when --ntasks-per-node > is set. > -- slurmctld - fix memory leak when using DebugFlags=Reservation. > -- Reset --mem and --mem-per-cpu options correctly when using --mem-per-gpu. > -- Use correct function signature for step_set_env() in gres plugin interface. > -- Restore pre-19.05 hostname handling behavior for AllocNodes by always > truncating to just the host portion and dropping any domain name portion > returned by gethostbyaddr(). > -- Fix abort initializing a configuration without acct_gather.conf. > -- Fix GRES binding and CLOUD nodes GRES setup regressions. > -- Make sview work with glib2 v2.62. > -- Fix slurmctld abort when in developer mode and submitting to multiple > partitions with a bad QOS and not enforcing QOS. > -- Enforce PART_NODES if only PartitionName is specified. > -- Fix slurmd -G functionality. > -- Fix build on 32-bit systems. > -- Remove duplicate log entry on update job. > -- sched/backfill - fix the estimated sched_nodes for multi-part jobs. > -- slurm.spec - fix pmix_version global context macro. > -- Fix cons_tres topology logic incorrectly evaluating insufficient resoruces. > -- Fix job "--switches=count at time" option handling in cons_tres topology. > -- scontrol - allow changes to the WorkDir for pending jobs. > -- Enable coordinators to delete users if they only belong to accounts that > the coordinator is over. > -- Fix regression on update from older versions with DefMemPerCPU. > -- Fix issues with --gpu-bind while using cgroups. > -- Suspend nodes after being down for SuspendTime. > -- Fix rebooting nodes from skipping nextstate states on boot. > -- Fix regression in reservation creation logic from 19.05.3 which would > incorrectly deny certain valid reservations from being created. > -- slurmdbd - process sacct/sacctmgr job queries from older clients correctly. From tim at schedmd.com Fri Dec 20 21:13:19 2019 From: tim at schedmd.com (Tim Wickberg) Date: Fri, 20 Dec 2019 14:13:19 -0700 Subject: [slurm-announce] Slurm versions 19.05.5 and 18.08.9 are now available (CVE-2019-19727 and CVE-2019-19728) Message-ID: Slurm versions 19.05.5 and 18.08.9 are now available, and include a series of recent bug fixes, as well as a fix for two moderate security vulnerabilities discussed below. SchedMD customers were informed on December 11th and provided a patch on request; this process is documented in our security policy [1]. CVE-2019-19727: Johannes Segitz from SUSE reported that slurmdbd.conf may be installed with insecure permissions by certain Slurm packaging systems. Slurm itself - as shipped by SchedMD - does not manage slurmdbd.conf directly, but the slurmdbd.conf.example sets a poor example by installing itself with 0644 permissions instead of 0600 in both the slurm.spec and slurm.spec-legacy packaging scripts. Sites are encourage to verify that the slurmdbd.conf file - which usually will contain your MySQL user and password - is secure on their clusters. Note that this configuration file is only needed by the slurmdbd primary (and optional backup) servers, and does not need to be accessible throughout the cluster. CVE-2019-19728: Harald Barth from the KTH Royal Institute of Technology reported that "srun --uid" may not always drop into the correct user account, and instead will print a warning message but launch the tasks as root. Note that "srun --uid" is only available to the root user, and that this issue is only shown by a race condition between successive lookup calls within the srun client command. SchedMD does not recommend use of the "srun --uid" option (e.g., it does not load the target user's environment but will export the root users) and may remove this option in a future release. Downloads are available at https://www.schedmd.com/downloads.php . Release notes follow below. - Tim [1] https://www.schedmd.com/security.php -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support > * Changes in Slurm 19.05.5 > ========================== > -- Fix both socket-[un]constrained GRES issues that would lead to incorrect > GRES allocations and GRES underflow errors at deallocation time. > -- Reject unrunnable jobs submitted to reservations. > -- Fix misleading error returned for immediate allocation requests when defer > in SchedulerParameters by decoupling defer from too fragmented logic. > -- Fix printf format string error on FreeBSD. > -- Fix parsing of delay_boot in controller when additional arguments follow it. > -- Fix --ntasks-per-node in cons_tres. > -- Fix array tasks getting same reject reason. > -- Ignore DOWN/DRAIN partitions in reduce_completing_frag logic. > -- Fix alloc_node validation when updating a job. > -- Fix for requesting specific nodes when using cons_tres topology. > -- Ensure x11 is setup before launching a job step. > -- Fix incorrect SLURM_CLUSTER_NAME env var in batch step. > -- Perl API - Fix undefined symbol for slurmdbd_pack_fini_msg. > -- Install slurmdbd.conf.example with 0600 permissions to encourage secure > use. CVE-2019-19727. > -- srun - do not continue with job launch if --uid fails. CVE-2019-19728. > * Changes in Slurm 18.08.9 > ========================== > -- Wrap END_TIMER{,2,3} macro definition in "do {} while (0)" block. > -- Make sview work with glib2 v2.62. > -- Make Slurm compile on linux after sys/sysctl.h was deprecated. > -- Install slurmdbd.conf.example with 0600 permissions to encourage secure > use. CVE-2019-19727. > -- srun - do not continue with job launch if --uid fails. CVE-2019-19728