[slurm-users] Slurm cloud scheduling/power saving

Steve Brasier steveb at stackhpc.com
Thu Apr 1 08:53:21 UTC 2021


Hi all, anyone have suggestions for debugging cloud nodes not resuming?
I've had this working before but I'm now using "configless" mode so
wondering if that's an issue.

If I login as SlurmUser and run the ResumeProgram manually, the specified
node(s) boot, and if I log into them `sinfo` works although it only shows
the "static" nodes, not the newly booted "cloud" nodes. So that at least
shows the program works, the image works, and new nodes can contact the
slurmctld.

However if I run a job which requires cloud nodes it immediately goes
Pending showing "Nodes required for job are DOWN, DRAINED or reserved for
jobs in higher priority partitions". Looking at SlurmctldLogFile
with SlurmdDebug=debug5 I don't see any attempt to boot the nodes at all
:-(.

I can post slurm.conf if anyone wants to look but I think the important
parameters are probably that I've got:

SlurmctldParameters=enable_configless,idle_on_node_suspend,cloud_dns,power_save_interval=10,power_save_min_interval=0

That look right?

thanks for any suggestions!

Steve

http://stackhpc.com/
Please note I work Tuesday to Friday.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210401/b1129926/attachment.htm>


More information about the slurm-users mailing list