[slurm-users] SLURM in K8s, any advice?

Nicolas Greneche nicolas.greneche at univ-paris13.fr
Mon Nov 14 12:00:15 UTC 2022


Hi Hans,

I work on this topic, this is my PhD subject. Here are some links :

Positioning paper, begining of my thesis :

https://www.computer.org/csdl/proceedings-article/sbac-pad/2020/992400a281/1o8qfAgSll6

More up to date, a study on containerization of 3 majors HPC schedulers 
(including SLURM) :

https://link.springer.com/chapter/10.1007/978-3-031-12597-3_13

And even more up to date (in fact, i'm presenting the paper at 9:15 AM 
at SC'22 today), an application to an autoscaling containerized OAR 
batch scheduler in the cloud (but il should easily be extended to SLURM) :

https://sites.google.com/view/supercompcloud

To cut a long story short. You will have to create a Pod containing 
slurmctld and munge. Optionally you may have a Pod mysql and slurmdbd.

Then you have Pods containing slurmd and munge container. My advice is 
to use the configless mode for slurmd. It avoid to distribute and 
synchronize the slurm.conf. The drawback is that you have to configure 
munge. But it's a good trade-off, the munge key is stable during the 
time, the slurm.conf can change (so you have to restribute it).

Beware of the fact that slurmctld must be restarted for major topology 
modification (it should have been fixed in more recent release). My 
advice is to run slurmctld as a son of a supervise process (djb 
daemontools will be perfect for this https://cr.yp.to/daemontools.html). 
This way you can restart slurmctld without losing jobs state.

Be extra careful with network name resolution synchronisation. There is 
a delta between the pod creation and its resolvability. You may use 
initcontainers to wait for resultion be OK before starting the main 
container of the pod.

Feel free to reach me if you want go furether in details !

Best Regards,

Le 14/11/2022 à 09:42, Viessmann Hans-Nikolai (PSI) a écrit :
> Good Morning,
> 
> I'm working on a project at work to run SLURM cluster management components
> (slurmctld and slurmdbd) as K8s pods, which manage a cluster of physical compute
> nodes. I've come upon a few discussions of doing this (or more generally running
> SLURM in containers); I especially found this one
> (see https://groups.google.com/g/slurm-users/c/uevFWPHHr2U/m/fkwusc0JDwAJ)
> very helpful.
> 
> Are there any further details or advice anyone has on such a setup?
> 
> Thank you and kind regards,
> Hans
> 
> ---------------------------------------------------------------------------------------------
> Paul Scherrer Institut
> Hans-Nikolai Viessmann
> High Performance Computing & Emerging Technologies
> Building/Room: OHSA/D02
> Forschungsstrasse 111
> 5232 Villigen PSI
> Switzerland
> 
> Telephone: +41 56 310 41 24
> E-Mail: hans-nikolai.viessmann at psi.ch
> GPG: 46F7 826E 80E1 EE45 2DCA 1BFC A39B E4B6 EA0C E4C4



More information about the slurm-users mailing list