[slurm-users] sched

Thu Dec 12 17:06:56 UTC 2019

Hey Steve,

I think it doesn't just "power down" the nodes but deletes the instances.
So then when you need a new node, it creates one, then provisions the
config, then updates the slurm cluster config...

That's how I understand it, but I haven't tried running it myself.

Regards,
Alex

On Thu, Dec 12, 2019 at 1:20 AM Steve Brasier <steveb at stackhpc.com> wrote:

> Hi, I'm hoping someone can shed some light on the SchedMD-provided example
> here https://github.com/SchedMD/slurm-gcp for an autoscaling cluster on
> Google Cloud Plaform (GCP).
>
> I understand that slurm autoscaling uses the power saving interface to
> create/remove nodes and the example suspend.py and resume.py scripts in the
> seem pretty clear and in line with the slurm docs. However I don't
> understand why the additional slurm-gcp-sync.py script is required. It
> seems to compare the states of nodes as seen by google compute and slurm
> and then on the GCP side either start instances or shut them down, and on
> the slurm side mark them as in RESUME or DOWN states. I don't see why this
> is necessary though; my understanding from the slurm docs is that e.g. the
> suspend script simply has to "power down" the nodes, and slurmctld will
> then mark them as in power saving mode - marking nodes down would seem to
> prevent jobs being scheduled on them, which isn't what we want. Similarly,
> I would have thought the resume.py script could mark nodes as in RESUME
> state itself, (once it's tested that the node is up and slurmd is running
> etc).
>
> thanks for any help
> Steve
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191212/4f21d46f/attachment-0001.htm>