Hi Ron,
I also am in the "paranoid" group. And I've always done updates with jobs "live". Depending on the size of your userbase you may want to consider pausing the submission/start of new jobs while you execute the dnf commands (yes, I use them, rather than the "raw" rpm, because I think they are less error prone with e.g. dependencies).
Since you are in the same group as myself, you can save a list of running jobs before and after executing the dnf commands, and see if they match. If they do, congratulations, everything went well. If they don't, there is a (tiny) risk that the jobs which completed during that time might miss "something". Examine their logs and/or warn the users as appropriate. To be clear, this tiny risk is about jobs that would complete *on their own* during that timeframe, not that the slurm update will cause healy jobs to crash. What could happen is a race condition between the jobs terminating and the slurm update which might try to update some information in some DB in an inconsistent way. My understanding is that the job itself (e.g. output file) are safe, it's just the slurm records which might get some trouble.
You mention "waiting at least a week" between a subsequent update, but really the key point is this
Before considering the upgrade complete, wait for all jobs that were already running to finish.
Which means: if you have a 6h wallclock limit, you can wait only 7h. If you have a 2 months wallclock limit you need to wait for a bit more than 2 months. If you don't have wallclock limit.... you may have to wait forever.... Wait! You have the list of jobs because you are paranoid like myself and made one as mentioned above, so you have to wait "only" for all of them to be completed before proceeding, not "forever".
With these precautions, most likely you won't encounter any issue (of course that gets weighted with the size of the cluster: if you have a huge one with hundreds of thousands of users/jobs/nodes, you will see things that have 0.001% chance of happening and that most of us never encounter)
HTH.
Thank you for that guidance. I am certainly in the "overly cautious" and "paranoid" groups.
I will probably go through the slower upgrade process (1-8 list), with at least a week between them.
And yes, if anyone has experience doing such a vault between versions, please chime in.
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com