Hi Alan,
Your topic is indeed my PhD thesis (defended late november). It consists in building autoscaling HPC infrastructure in the cloud (in a compute node provisioning point of view). In this work I show that kubernetes default controllers are not well designed for autoscaling containerized HPC clusters [1] and I wrote a super basic K8s controller for OAR [2] (another scheduler developped at INRIA). This controller deserves a rewrite, it's only a proof of concept ^^.
My guess is that you have nesting issues with cgroup/v2 inside your containerized compute node (those that runs slurmd) ? If it's the case may be you can use katacontainers [3] instead of CRI-O as a container engine [4] (I made it in 2021). The main asset of katacontainers is that it can use KVM.
The consequence is that your CPU is a "real" vcpus and not a quota enforced by cgroups. I noticed that with kata, my slurmd containers had a cpuinfos / nproc that reflected the limits enforced in my K8s manifest. I didn't went deep with kata because I focused on "controller" aspect of things. But using KVM may hide the nesting of cgroups to slurmd ?
I hope this help !
Kind regards,
[1] A methodology to scale containerized HPC infrastructures in the Cloud Nicolas Greneche, Christophe Cérin and PTarek Menouer at Euro-par 2022
[2] Autoscaling of Containerized HPC Clusters in the Cloud Nicolas Greneche, Christophe Cerin at SuperCompCloud: 6th Workshop on Interoperability of Supercomputing and Cloud Technologies (Held in conjunction with SC'22)
[4] https://github.com/kata-containers/documentation/blob/master/how-to/run-kata...
Le 13/03/2024 à 09:06, LEAVY Alan via slurm-users a écrit :
I’m a little late to this party but would love to establish contact with others using slurm in Kubernetes.
I recently joined a research institute in Vienna (IIASA) and I’m getting to grips with slurm and Kubernetes (my previous role was data engineering / fintech). My current setup sounds like what Urban described in this thread, back in Nov 22. It has some rough edges though.
Right now, I’m trying to upgrade to slurm-23.11.4 in Ubuntu 23.10 containers. I’m having trouble with the cgroup/v2 plugin.
Are you still using slurm on K8s Urban? How did your installation work out Hans? Would either of you be willing to share your experiences?
Regards,
Alan.