<div dir="ltr">Thanks Alex - that is mostly how I understand it too. However my understanding from the docs (and the GCP example actually) is that the cluster isn't reconfigured in the sense of rewriting slurm.conf and restarting the daemons (i.e. how you might manually resize a cluster), it's just nodes are marked by slurmctld as "powered down", even if the actual instances are released back to the cloud. So my query still stands I think.<div><br></div><div>regards</div><div>Steve</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 12 Dec 2019 at 17:08, Alex Chekholko <<a href="mailto:alex@calicolabs.com">alex@calicolabs.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hey Steve,<br><div><br></div><div>I think it doesn't just "power down" the nodes but deletes the instances. So then when you need a new node, it creates one, then provisions the config, then updates the slurm cluster config...</div><div><br></div><div>That's how I understand it, but I haven't tried running it myself.</div><div><br></div><div>Regards,</div><div>Alex</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 12, 2019 at 1:20 AM Steve Brasier <<a href="mailto:steveb@stackhpc.com" target="_blank">steveb@stackhpc.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi, I'm hoping someone can shed some light on the SchedMD-provided example here <a href="https://github.com/SchedMD/slurm-gcp" target="_blank">https://github.com/SchedMD/slurm-gcp</a> for an autoscaling cluster on Google Cloud Plaform (GCP).</div><div><br></div><div>I understand that slurm autoscaling uses the power saving interface to create/remove nodes and the example suspend.py and resume.py scripts in the seem pretty clear and in line with the slurm docs. However I don't understand why the additional slurm-gcp-sync.py script is required. It seems to compare the states of nodes as seen by google compute and slurm and then on the GCP side either start instances or shut them down, and on the slurm side mark them as in RESUME or DOWN states. I don't see why this is necessary though; my understanding from the slurm docs is that e.g. the suspend script simply has to "power down" the nodes, and slurmctld will then mark them as in power saving mode - marking nodes down would seem to prevent jobs being scheduled on them, which isn't what we want. Similarly, I would have thought the resume.py script could mark nodes as in RESUME state itself, (once it's tested that the node is up and slurmd is running etc).</div><div><br></div><div>thanks for any help</div><div>Steve</div></div>
</blockquote></div>
</blockquote></div>