Note: I’m aware that I can run Kube on a single node, but we need more resources. So ultimately we need a way to have Slurm and Kube exist in the same cluster, both sharing the full amount of resources and both being fully aware of resource usage.
This is something that we (SchedMD) are working on, although it's a bit earlier than I was planning to publicly announce anything...
This is a very high-level view, and I have to apologize for stalling a bit, but: we've hired a team to build out a collection of tools that we're calling "Slinky" [1]. These provide for canonical ways of running Slurm within Kubernetes, ways of maintaining and managing the cluster state, and scheduling integration to allow for compute nodes to be available to both Kubernetes and Slurm environments while coordinating their status.
We'll be talking about it in more details at the Slurm User Group Meeting in Oslo [3], then KubeCon North America in Salt Lake, and SC'24 in Atlanta. We'll have the (open-source, Apache 2.0 licensed) code for our first development phase available by SC'24 if not sooner.
There's a placeholder documentation page [4] that points to some of the presentations I've given before talking about approaches to tackling this converged-computing model, but I'll caution they're a bit dated and the Slinky-specific presentation we've been working on internally aren't publicly available yet.
If there are SchedMD support customers that have specific use cases, please feel free to ping your account managers if you'd like to chat at some point in the next few months.
- Tim
[1] Slinky is not an acronym (neither is Slurm [2]), but loosely stands for "Slurm in Kubernetes".
[2] https://slurm.schedmd.com/faq.html#acronym
[3] https://www.schedmd.com/about-schedmd/events/
[4] https://slurm.schedmd.com/slinky.html