[slurm-users] sched

Steve Brasier steveb at stackhpc.com
Thu Dec 12 09:20:16 UTC 2019


Hi, I'm hoping someone can shed some light on the SchedMD-provided example
here https://github.com/SchedMD/slurm-gcp for an autoscaling cluster on
Google Cloud Plaform (GCP).

I understand that slurm autoscaling uses the power saving interface to
create/remove nodes and the example suspend.py and resume.py scripts in the
seem pretty clear and in line with the slurm docs. However I don't
understand why the additional slurm-gcp-sync.py script is required. It
seems to compare the states of nodes as seen by google compute and slurm
and then on the GCP side either start instances or shut them down, and on
the slurm side mark them as in RESUME or DOWN states. I don't see why this
is necessary though; my understanding from the slurm docs is that e.g. the
suspend script simply has to "power down" the nodes, and slurmctld will
then mark them as in power saving mode - marking nodes down would seem to
prevent jobs being scheduled on them, which isn't what we want. Similarly,
I would have thought the resume.py script could mark nodes as in RESUME
state itself, (once it's tested that the node is up and slurmd is running
etc).

thanks for any help
Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191212/fd57e13e/attachment.htm>


More information about the slurm-users mailing list