[slurm-users] Running Containerized Slurmctld and Slurmdb in Production?
mhanby at uab.edu
Wed Mar 15 16:48:39 UTC 2023
FYI, after more internet sleuthing (searching for “juju slurm”) I came across this outstanding looking project: Omnivector Slurm Distribution (OSD): https://omnivector-solutions.github.io/osd-documentation/master/index.html
This project uses Juju (Canonical project) to deploy, configure and manage a Slurm cluster along with a variety of other components, like SlurmREST API, Prometheus integration , log forwarding via Fluentbit to Graylog and others
Deployment targets include cloud AWS/Openstack, local LXD, MAAS for baremetal…
I’ve only started to play with OSD, but it looks like a great framework for deploying Slurm clusters.
Quick install on an Ubuntu 22.04LTS host:
sudo snap install juju --classic
sudo snap install lxd
lxd init --auto
lxc network set lxdbr0 ipv6.address none
sudo ufw allow 8443/tcp
juju bootstrap --show-log localhost
Followed by a quick test of sinfo:
juju run --unit slurmctld/0 "sinfo"
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
osd-slurmd up infinite 1 down* juju-65df3d-2
juju run --unit slurmctld/0 "sinfo -R"
REASON USER TIMESTAMP NODELIST
New node slurm 2023-03-15T01:21:21 juju-65df3d-2
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Hanby, Mike <mhanby at uab.edu>
Date: Wednesday, February 15, 2023 at 1:51 PM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Running Containerized Slurmctld and Slurmdb in Production?
Just wondering if any sites are running containerized Slurmctld and Slurmdbd in production?
We are in the process of planning migrating from a single host running slurmctld, slurmdbd, and MySQL (and other HPC services) to separate OpenStack VMs. Our site averages less than 1000’s running / pending jobs at any given time. Like many HPC sites, our jobs are a mix of long running, large arrays, very short…
I ran across this Github project “Slurm Docker Cluster” https://github.com/giovtorres/slurm-docker-cluster<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgiovtorres%2Fslurm-docker-cluster&data=05%7C01%7Cmhanby%40uab.edu%7C6dd0fbb8a506499d329308db0f85b1f9%7Cd8999fe476af40b3b4351d8977abc08c%7C1%7C0%7C638120839125275887%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Wt%2Fc%2BdpX5xMFtTn47aZOPF%2BELV7H0mb%2Fb4Eib9atgaI%3D&reserved=0> and got me thinking that this method might be great for simpler upgrades, ease of reproducing the cluster in development, etc…
How about it, anyone running containerized Slurm server processes in production?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users