Convergence of Kube and Slurm?

List overview All Threads
Download

newer

older

Final Call for SLUG Standard...

Slurm fails before nvidia-smi...

Dan Healy

4 May 2024 4 May '24

9:05 p.m.

Bright Cluster Manager has some verbiage on their marketing site that they can manage a cluster running both Kubernetes and Slurm. Maybe I misunderstood it. But nevertheless, I am encountering groups more frequently that want to run a stack of containers that need private container networking.

What’s the current state of using the same HPC cluster for both Slurm and Kube?

Note: I’m aware that I can run Kube on a single node, but we need more resources. So ultimately we need a way to have Slurm and Kube exist in the same cluster, both sharing the full amount of resources and both being fully aware of resource usage.

Thanks,

Daniel Healy

Attachments:

attachment.html (text/html — 880 bytes)

Show replies by date

Daniel Letai

6 May 6 May

10:54 a.m.

Tim Wickberg

7 May 7 May

12:47 a.m.

...

Note: I’m aware that I can run Kube on a single node, but we need more resources. So ultimately we need a way to have Slurm and Kube exist in the same cluster, both sharing the full amount of resources and both being fully aware of resource usage.

This is something that we (SchedMD) are working on, although it's a bit earlier than I was planning to publicly announce anything...

This is a very high-level view, and I have to apologize for stalling a bit, but: we've hired a team to build out a collection of tools that we're calling "Slinky" [1]. These provide for canonical ways of running Slurm within Kubernetes, ways of maintaining and managing the cluster state, and scheduling integration to allow for compute nodes to be available to both Kubernetes and Slurm environments while coordinating their status.

We'll be talking about it in more details at the Slurm User Group Meeting in Oslo [3], then KubeCon North America in Salt Lake, and SC'24 in Atlanta. We'll have the (open-source, Apache 2.0 licensed) code for our first development phase available by SC'24 if not sooner.

There's a placeholder documentation page [4] that points to some of the presentations I've given before talking about approaches to tackling this converged-computing model, but I'll caution they're a bit dated and the Slinky-specific presentation we've been working on internally aren't publicly available yet.

If there are SchedMD support customers that have specific use cases, please feel free to ping your account managers if you'd like to chat at some point in the next few months.

- Tim

[1] Slinky is not an acronym (neither is Slurm [2]), but loosely stands for "Slurm in Kubernetes".

[2] https://slurm.schedmd.com/faq.html#acronym

[3] https://www.schedmd.com/about-schedmd/events/

[4] https://slurm.schedmd.com/slinky.html

-- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support

Bjørn-Helge Mevik

6:26 a.m.

Tim Wickberg via slurm-users slurm-users@lists.schedmd.com writes:

...

[1] Slinky is not an acronym (neither is Slurm [2]), but loosely stands for "Slurm in Kubernetes".

And not at all inspired by Slinky Dog in Toy Story, I guess. :D

-- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

wdennis＠nec-labs.com

29 Jul 29 Jul

6:42 p.m.

Can I ask if this replaces the work on "SUNK" that was previously announced? (but never released as open-source on GitHub as was planned; looks like it is only available on CoreWeave Cloud?)

374

Age (days ago)

460

Last active (days ago)

slurm-users@lists.schedmd.com

4 comments

5 participants

tags (0)

participants (5)

Bjørn-Helge Mevik
Dan Healy
Daniel Letai
Tim Wickberg
wdennis＠nec-labs.com