I really struggle to see the point of k8s for large computational workloads. It adds a lot of complexity, and I don’t see what benefit it brings.

If you really want to run containerised workloads as batch jobs on AWS, for example, then it’s a great deal simpler to do so using AWS Batch and ECS rather than doing all that stuff with Kubernetes.

Creating a Batch queue and job definition in CDK can be done in a couple of dozen lines of code. See the example I wrote a year or so ago, recently updated now that AWS Batch has fully supported L2 constructs in CDK: https://github.com/tcutts/cdk-batch-python/tree/main which has a few more bells and whistles, like triggering batch job submissions as files arrive in an S3 bucket, and closing the queue to jobs automatically if a budget threshold is exceeded, but it’s still only about 200 lines of code.

I really don’t understand what k8s would add to that sort of architecture. In fact, when AWS added support for EKS to AWS Batch, I asked the internal team what the point of that was, and it was basically just “some customers insisted on it”. No-one could actually articulate for me what tangible benefit there was to it.

Tim

Tim Cutts

Scientific Computing Platform Lead

AstraZeneca

Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Catalogue |

From: Sylvain MARET via slurm-users <slurm-users@lists.schedmd.com>
Date: Wednesday, 13 March 2024 at 10:29
To: Nicolas Greneche <nicolas.greneche@univ-paris13.fr>, slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Re: SLURM in K8s, any advice?

Hello,

I haven't played with slurm in k8s but I did attend this talk :
https://fosdem.org/2024/schedule/event/fosdem-2024-2590-kubernetes-and-hpc-bare-metal-bros/

Which shows at least someone was able to do so and maybe it'll be worth
to talk to her about it. I wanted to ask her for the code to reproduce
her experiment but I don't have the time yet to do so.

Regards,
Sylvain Maret

On 13/03/2024 11:04, Nicolas Greneche via slurm-users wrote:
> CAUTION : External Sender. Please do not click on links or open
> attachments from senders you do not trust.
>
>
> Hi Alan,
>
> Your topic is indeed my PhD thesis (defended late november). It consists
> in building autoscaling HPC infrastructure in the cloud (in a compute
> node provisioning point of view). In this work I show that kubernetes
> default controllers are not well designed for autoscaling containerized
> HPC clusters [1] and I wrote a super basic K8s controller for OAR [2]
> (another scheduler developped at INRIA). This controller deserves a
> rewrite, it's only a proof of concept ^^.
>
> My guess is that you have nesting issues with cgroup/v2 inside your
> containerized compute node (those that runs slurmd) ? If it's the case
> may be you can use katacontainers [3] instead of CRI-O as a container
> engine [4] (I made it in 2021). The main asset of katacontainers is that
> it can use KVM.
>
> The consequence is that your CPU is a "real" vcpus and not a quota
> enforced by cgroups. I noticed that with kata, my slurmd containers had
> a cpuinfos / nproc that reflected the limits enforced in my K8s
> manifest. I didn't went deep with kata because I focused on "controller"
> aspect of things. But using KVM may hide the nesting of cgroups to
> slurmd ?
>
> I hope this help !
>
> Kind regards,
>
> [1] A methodology to scale containerized HPC infrastructures in the
> Cloud Nicolas Greneche, Christophe Cérin and PTarek Menouer at
> Euro-par 2022
>
> [2] Autoscaling of Containerized HPC Clusters in the Cloud Nicolas
> Greneche, Christophe Cerin at SuperCompCloud: 6th Workshop on
> Interoperability of Supercomputing and Cloud Technologies (Held in
> conjunction with SC'22)
>
> [3] https://katacontainers.io
>
> [4]
> https://github.com/kata-containers/documentation/blob/master/how-to/run-kata-with-k8s.md
>
>
> Le 13/03/2024 à 09:06, LEAVY Alan via slurm-users a écrit :
>> I’m a little late to this party but would love to establish contact with
>> others using slurm in Kubernetes.
>>
>> I recently joined a research institute in Vienna (IIASA) and I’m getting
>> to grips with slurm and Kubernetes (my previous role was data
>> engineering / fintech). My current setup sounds like what Urban
>> described in this thread, back in Nov 22. It has some rough edges
>> though.
>>
>> Right now, I’m trying to upgrade to slurm-23.11.4 in Ubuntu 23.10
>> containers. I’m having trouble with the cgroup/v2 plugin.
>>
>> Are you still using slurm on K8s Urban? How did your installation work
>> out Hans?
>> Would either of you be willing to share your experiences?
>>
>> Regards,
>>
>> Alan.
>>
>>
>>
>
> --
> Nicolas Greneche
> USPN / DSI
> Support à la recherche / RSSI Suppléant
> https://www-magi.univ-paris13.fr
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com