[slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

Ryan Novosielski novosirj at rutgers.edu
Thu Feb 3 20:59:42 UTC 2022

> On Feb 3, 2022, at 2:55 PM, Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> wrote:
> On 03-02-2022 16:37, Nathan Smith wrote:
>> Yes, we are running slurmdbd. We could arrange enough downtime to do an incremental upgrade of major versions as Brian Andrus suggested, at least on the slurmctld and slurmdbd systems. The slurmds I would just do a direct upgrade once the scheduler work was completed.
> As Brian Andrus said, you must upgrade Slurm by at most 2 major versions, and that includes slurmd's as well!  Don't do a "direct upgrade" of slurmd by more than 2 versions!
> I recommend separate physical servers for slurmdbd and slurmctld.  Then you can upgrade slurmdbd without taking the cluster offline.  It's OK for slurmdbd to be down for many hours, since slurmctld caches the state information in the meantime.

The one thing you want to watch out for here – maybe more so if you are using a VM than a physical server as you may have sized the RAM for how much slurmctld appears to need, as we did – is that that caching that takes place on the slurmctld uses memory (I guess obviously, when you think about it). The result there can be that eventually if you have slurmd down for a long time (we had someone who was hitting a bug that would start running jobs right after everyone went to sleep for example), your slurmctld can run out of memory, crash, and then that cache is lost. You don’t normally see that memory being used like that, because slurmdbd is normally up/accepting the accounting data.

|| \\UTGERS,  	 |---------------------------*O*---------------------------
||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark

More information about the slurm-users mailing list