Bright Cluster Manager has some verbiage on their marketing site that they can manage a cluster running both Kubernetes and Slurm. Maybe I misunderstood it. But nevertheless, I am encountering groups more frequently that want to run a stack of containers that need private container networking.
What’s the current state of using the same HPC cluster for both Slurm and Kube?
Note: I’m aware that I can run Kube on a single node, but we need more resources. So ultimately we need a way to have Slurm and Kube exist in the same cluster, both sharing the full amount of resources and both being fully aware of resource usage.
Thanks,
Daniel Healy
Note: I’m aware that I can run Kube on a single node, but we need more resources. So ultimately we need a way to have Slurm and Kube exist in the same cluster, both sharing the full amount of resources and both being fully aware of resource usage.
This is something that we (SchedMD) are working on, although it's a bit earlier than I was planning to publicly announce anything...
This is a very high-level view, and I have to apologize for stalling a bit, but: we've hired a team to build out a collection of tools that we're calling "Slinky" [1]. These provide for canonical ways of running Slurm within Kubernetes, ways of maintaining and managing the cluster state, and scheduling integration to allow for compute nodes to be available to both Kubernetes and Slurm environments while coordinating their status.
We'll be talking about it in more details at the Slurm User Group Meeting in Oslo [3], then KubeCon North America in Salt Lake, and SC'24 in Atlanta. We'll have the (open-source, Apache 2.0 licensed) code for our first development phase available by SC'24 if not sooner.
There's a placeholder documentation page [4] that points to some of the presentations I've given before talking about approaches to tackling this converged-computing model, but I'll caution they're a bit dated and the Slinky-specific presentation we've been working on internally aren't publicly available yet.
If there are SchedMD support customers that have specific use cases, please feel free to ping your account managers if you'd like to chat at some point in the next few months.
- Tim
[1] Slinky is not an acronym (neither is Slurm [2]), but loosely stands for "Slurm in Kubernetes".
[2] https://slurm.schedmd.com/faq.html#acronym
[3] https://www.schedmd.com/about-schedmd/events/
[4] https://slurm.schedmd.com/slinky.html
Tim Wickberg via slurm-users slurm-users@lists.schedmd.com writes:
[1] Slinky is not an acronym (neither is Slurm [2]), but loosely stands for "Slurm in Kubernetes".
And not at all inspired by Slinky Dog in Toy Story, I guess. :D
Can I ask if this replaces the work on "SUNK" that was previously announced? (but never released as open-source on GitHub as was planned; looks like it is only available on CoreWeave Cloud?)