[slurm-users] sched

Fri Dec 13 11:13:25 UTC 2019

Thanks Alex - that is mostly how I understand it too. However my
understanding from the docs (and the GCP example actually) is that the
cluster isn't reconfigured in the sense of rewriting slurm.conf and
restarting the daemons (i.e. how you might manually resize a cluster), it's
just nodes are marked by slurmctld as "powered down", even if the actual
instances are released back to the cloud. So my query still stands I think.

regards
Steve

On Thu, 12 Dec 2019 at 17:08, Alex Chekholko <alex at calicolabs.com> wrote:

> Hey Steve,
>
> I think it doesn't just "power down" the nodes but deletes the instances.
> So then when you need a new node, it creates one, then provisions the
> config, then updates the slurm cluster config...
>
> That's how I understand it, but I haven't tried running it myself.
>
> Regards,
> Alex
>
> On Thu, Dec 12, 2019 at 1:20 AM Steve Brasier <steveb at stackhpc.com> wrote:
>
>> Hi, I'm hoping someone can shed some light on the SchedMD-provided
>> example here https://github.com/SchedMD/slurm-gcp for an autoscaling
>> cluster on Google Cloud Plaform (GCP).
>>
>> I understand that slurm autoscaling uses the power saving interface to
>> create/remove nodes and the example suspend.py and resume.py scripts in the
>> seem pretty clear and in line with the slurm docs. However I don't
>> understand why the additional slurm-gcp-sync.py script is required. It
>> seems to compare the states of nodes as seen by google compute and slurm
>> and then on the GCP side either start instances or shut them down, and on
>> the slurm side mark them as in RESUME or DOWN states. I don't see why this
>> is necessary though; my understanding from the slurm docs is that e.g. the
>> suspend script simply has to "power down" the nodes, and slurmctld will
>> then mark them as in power saving mode - marking nodes down would seem to
>> prevent jobs being scheduled on them, which isn't what we want. Similarly,
>> I would have thought the resume.py script could mark nodes as in RESUME
>> state itself, (once it's tested that the node is up and slurmd is running
>> etc).
>>
>> thanks for any help
>> Steve
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191213/609d9c6b/attachment.htm>