[slurm-users] Slurm Upgrade Philosophy?

Thu Dec 24 14:24:43 UTC 2020

We are the same way, though we tend to keep pace with minor releases.  
We typically wait until the .1 release of a new major release before 
considering upgrade so that many of the bugs are worked out.  We then 
have a test cluster that we install the release on a run a few test jobs 
to make sure things are working, usually MPI jobs as they tend to hit 
most of the features of the scheduler.

We also like to stay current with releases as there are new features we 
want, or features we didn't know we wanted but our users find and start 
using.  So our general methodology is to upgrade to the latest minor 
release at our next monthly maintenance.  For major releases we will 
upgrade at our next monthly maintenance after the .1 release is out 
unless there is a show stopping bug that we run into in our own 
testing.  At which point we file a bug with SchedMD and get a patch.

-Paul Edmon-

On 12/24/2020 1:57 AM, Chris Samuel wrote:
> On Friday, 18 December 2020 10:10:19 AM PST Jason Simms wrote:
>
>> Thanks to several helpful members on this list, I think I have a much better
>> handle on how to upgrade Slurm. Now my question is, do most of you upgrade
>> with each major release?
> We do, though not immediately and not without a degree of testing on our test
> systems.  One of the big reasons for us upgrading is that we've usually paid
> for features in Slurm for our needs (for example in 20.11 that includes
> scrontab so users won't be tied to favourite login nodes, as well as  the
> experimental RPC queue code due to the large numbers of RPCs our systems need
> to cope with).
>
> I also keep an eye out for discussions of what other sites find with new
> releases too, so I'm following the current concerns about 20.11 and the change
> in behaviour for job steps that do (expanding NVIDIA's example slightly):
>
> #SBATCH --exclusive
> #SBATCH -N2
> srun --ntasks-per-node=1 python multi_node_launch.py
>
> which (if I'm reading the bugs correctly) fails in 20.11 as that srun no
> longer gets all the allocated resources, instead just gets the default of
> --cpus-per-task=1 instead, which also affects things like mpirun in OpenMPI
> built with Slurm support (as it effectively calls "srun orted" and that "orted"
> launches the MPI ranks, so in 20.11 it only has access to a single core for
> them all to fight over).  Again - if I'm interpreting the bugs correctly!
>
> I don't currently have a test system that's free to try 20.11 on, but
> hopefully early in the new year I'll be able to test this out to see how much
> of an impact this is going to have and how we will manage it.
>
> https://bugs.schedmd.com/show_bug.cgi?id=10383
> https://bugs.schedmd.com/show_bug.cgi?id=10489
>
> All the best,
> Chris