Hello,
I haven't played with slurm in k8s but I did attend this talk : https://fosdem.org/2024/schedule/event/fosdem-2024-2590-kubernetes-and-hpc-b...
Which shows at least someone was able to do so and maybe it'll be worth to talk to her about it. I wanted to ask her for the code to reproduce her experiment but I don't have the time yet to do so.
Regards, Sylvain Maret
On 13/03/2024 11:04, Nicolas Greneche via slurm-users wrote:
CAUTION : External Sender. Please do not click on links or open attachments from senders you do not trust.
Hi Alan,
Your topic is indeed my PhD thesis (defended late november). It consists in building autoscaling HPC infrastructure in the cloud (in a compute node provisioning point of view). In this work I show that kubernetes default controllers are not well designed for autoscaling containerized HPC clusters [1] and I wrote a super basic K8s controller for OAR [2] (another scheduler developped at INRIA). This controller deserves a rewrite, it's only a proof of concept ^^.
My guess is that you have nesting issues with cgroup/v2 inside your containerized compute node (those that runs slurmd) ? If it's the case may be you can use katacontainers [3] instead of CRI-O as a container engine [4] (I made it in 2021). The main asset of katacontainers is that it can use KVM.
The consequence is that your CPU is a "real" vcpus and not a quota enforced by cgroups. I noticed that with kata, my slurmd containers had a cpuinfos / nproc that reflected the limits enforced in my K8s manifest. I didn't went deep with kata because I focused on "controller" aspect of things. But using KVM may hide the nesting of cgroups to slurmd ?
I hope this help !
Kind regards,
[1] A methodology to scale containerized HPC infrastructures in the Cloud Nicolas Greneche, Christophe Cérin and PTarek Menouer at Euro-par 2022
[2] Autoscaling of Containerized HPC Clusters in the Cloud Nicolas Greneche, Christophe Cerin at SuperCompCloud: 6th Workshop on Interoperability of Supercomputing and Cloud Technologies (Held in conjunction with SC'22)
[4] https://github.com/kata-containers/documentation/blob/master/how-to/run-kata...
Le 13/03/2024 à 09:06, LEAVY Alan via slurm-users a écrit :
I’m a little late to this party but would love to establish contact with others using slurm in Kubernetes.
I recently joined a research institute in Vienna (IIASA) and I’m getting to grips with slurm and Kubernetes (my previous role was data engineering / fintech). My current setup sounds like what Urban described in this thread, back in Nov 22. It has some rough edges though.
Right now, I’m trying to upgrade to slurm-23.11.4 in Ubuntu 23.10 containers. I’m having trouble with the cgroup/v2 plugin.
Are you still using slurm on K8s Urban? How did your installation work out Hans? Would either of you be willing to share your experiences?
Regards,
Alan.
-- Nicolas Greneche USPN / DSI Support à la recherche / RSSI Suppléant https://www-magi.univ-paris13.fr
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com