On 03-02-2022 21:59, Ryan Novosielski wrote:
>> On Feb 3, 2022, at 2:55 PM, Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> wrote:
>> On 03-02-2022 16:37, Nathan Smith wrote:
>>> Yes, we are running slurmdbd. We could arrange enough downtime to do an incremental upgrade of major versions as Brian Andrus suggested, at least on the slurmctld and slurmdbd systems. The slurmds I would just do a direct upgrade once the scheduler work was completed.
>> As Brian Andrus said, you must upgrade Slurm by at most 2 major versions, and that includes slurmd's as well!  Don't do a "direct upgrade" of slurmd by more than 2 versions!
>> I recommend separate physical servers for slurmdbd and slurmctld.  Then you can upgrade slurmdbd without taking the cluster offline.  It's OK for slurmdbd to be down for many hours, since slurmctld caches the state information in the meantime.
> The one thing you want to watch out for here – maybe more so if you are using a VM than a physical server as you may have sized the RAM for how much slurmctld appears to need, as we did – is that that caching that takes place on the slurmctld uses memory (I guess obviously, when you think about it). The result there can be that eventually if you have slurmd down for a long time (we had someone who was hitting a bug that would start running jobs right after everyone went to sleep for example), your slurmctld can run out of memory, crash, and then that cache is lost. You don’t normally see that memory being used like that, because slurmdbd is normally up/accepting the accounting data.

The slurmctld caches job state information in:
# scontrol show config | grep StateSaveLocation
StateSaveLocation       = /var/spool/slurmctld

The StateSaveLocation should retain job information even if slurmctld 
crashes (at least the data which have been committed to disk).

The StateSaveLocation file system must not fill up, of course!  There 
are also some upper limits to the number of records in 
StateSaveLocation, but I can't find the numbers right now.


