[slurm-users] Slurm cloud scheduling/power saving

Thu Apr 1 16:57:48 UTC 2021

Run 'sinfo -R' to see if any of your nodes are out of the mix.

If so, resume them and see if things work.

Brian Andrus

On 4/1/2021 1:53 AM, Steve Brasier wrote:
> Hi all, anyone have suggestions for debugging cloud nodes not 
> resuming? I've had this working before but I'm now using "configless" 
> mode so wondering if that's an issue.
>
> If I login as SlurmUser and run the ResumeProgram manually, the 
> specified node(s) boot, and if I log into them `sinfo` works although 
> it only shows the "static" nodes, not the newly booted "cloud" nodes. 
> So that at least shows the program works, the image works, and new 
> nodes can contact the slurmctld.
>
> However if I run a job which requires cloud nodes it immediately goes 
> Pending showing "Nodes required for job are DOWN, DRAINED or reserved 
> for jobs in higher priority partitions". Looking at SlurmctldLogFile 
> with SlurmdDebug=debug5 I don't see any attempt to boot the nodes at 
> all :-(.
>
> I can post slurm.conf if anyone wants to look but I think the 
> important parameters are probably that I've got:
>
> SlurmctldParameters=enable_configless,idle_on_node_suspend,cloud_dns,power_save_interval=10,power_save_min_interval=0
>
> That look right?
>
> thanks for any suggestions!
>
> Steve
>
> http://stackhpc.com/ <http://stackhpc.com/>
> Please note I work Tuesday to Friday.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210401/620635dd/attachment.htm>