I really struggle to see the point of k8s for large computational workloads. It adds a lot of complexity, and I don’t see what benefit it brings.
If you really want to run containerised workloads as batch jobs on AWS, for example, then it’s a great deal simpler to do so using AWS Batch and ECS rather than doing all that stuff
with Kubernetes.
Creating a Batch queue and job definition in CDK can be done in a couple of dozen lines of code. See the example I wrote a year or so ago, recently updated now that AWS Batch has
fully supported L2 constructs in CDK:
https://github.com/tcutts/cdk-batch-python/tree/main which has a few more bells and whistles, like triggering batch job submissions as files arrive in an S3 bucket, and closing the queue to jobs automatically if a budget threshold is exceeded, but it’s
still only about 200 lines of code.
I really don’t understand what k8s would add to that sort of architecture. In fact, when AWS added support for EKS to AWS Batch, I asked the internal team what the point of that
was, and it was basically just “some customers insisted on it”. No-one could actually articulate for me what tangible benefit there was to it.
Tim
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service
Catalogue |
From:
Sylvain MARET via slurm-users <slurm-users@lists.schedmd.com>
Date: Wednesday, 13 March 2024 at 10:29
To: Nicolas Greneche <nicolas.greneche@univ-paris13.fr>, slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Re: SLURM in K8s, any advice?
Hello,
I haven't played with slurm in k8s but I did attend this talk :
https://fosdem.org/2024/schedule/event/fosdem-2024-2590-kubernetes-and-hpc-bare-metal-bros/
Which shows at least someone was able to do so and maybe it'll be worth
to talk to her about it. I wanted to ask her for the code to reproduce
her experiment but I don't have the time yet to do so.
Regards,
Sylvain Maret
On 13/03/2024 11:04, Nicolas Greneche via slurm-users wrote:
> CAUTION : External Sender. Please do not click on links or open
> attachments from senders you do not trust.
>
>
> Hi Alan,
>
> Your topic is indeed my PhD thesis (defended late november). It consists
> in building autoscaling HPC infrastructure in the cloud (in a compute
> node provisioning point of view). In this work I show that kubernetes
> default controllers are not well designed for autoscaling containerized
> HPC clusters [1] and I wrote a super basic K8s controller for OAR [2]
> (another scheduler developped at INRIA). This controller deserves a
> rewrite, it's only a proof of concept ^^.
>
> My guess is that you have nesting issues with cgroup/v2 inside your
> containerized compute node (those that runs slurmd) ? If it's the case
> may be you can use katacontainers [3] instead of CRI-O as a container
> engine [4] (I made it in 2021). The main asset of katacontainers is that
> it can use KVM.
>
> The consequence is that your CPU is a "real" vcpus and not a quota
> enforced by cgroups. I noticed that with kata, my slurmd containers had
> a cpuinfos / nproc that reflected the limits enforced in my K8s
> manifest. I didn't went deep with kata because I focused on "controller"
> aspect of things. But using KVM may hide the nesting of cgroups to
> slurmd ?
>
> I hope this help !
>
> Kind regards,
>
> [1] A methodology to scale containerized HPC infrastructures in the
> Cloud Nicolas Greneche, Christophe Cérin and PTarek Menouer at
> Euro-par 2022
>
> [2] Autoscaling of Containerized HPC Clusters in the Cloud Nicolas
> Greneche, Christophe Cerin at SuperCompCloud: 6th Workshop on
> Interoperability of Supercomputing and Cloud Technologies (Held in
> conjunction with SC'22)
>
> [3]
https://katacontainers.io
>
> [4]
>
https://github.com/kata-containers/documentation/blob/master/how-to/run-kata-with-k8s.md
>
>
> Le 13/03/2024 à 09:06, LEAVY Alan via slurm-users a écrit :
>> I’m a little late to this party but would love to establish contact with
>> others using slurm in Kubernetes.
>>
>> I recently joined a research institute in Vienna (IIASA) and I’m getting
>> to grips with slurm and Kubernetes (my previous role was data
>> engineering / fintech). My current setup sounds like what Urban
>> described in this thread, back in Nov 22. It has some rough edges
>> though.
>>
>> Right now, I’m trying to upgrade to slurm-23.11.4 in Ubuntu 23.10
>> containers. I’m having trouble with the cgroup/v2 plugin.
>>
>> Are you still using slurm on K8s Urban? How did your installation work
>> out Hans?
>> Would either of you be willing to share your experiences?
>>
>> Regards,
>>
>> Alan.
>>
>>
>>
>
> --
> Nicolas Greneche
> USPN / DSI
> Support à la recherche / RSSI Suppléant
>
https://www-magi.univ-paris13.fr
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com