[slurm-users] How do you orchestrate SLURM operations, what tools do you use?

Paul Edmon pedmon at cfa.harvard.edu
Wed Aug 15 08:01:19 MDT 2018


So we use NHC for our automatic node closer.  For reopening we have a 
series of scripts that we use but they are all ad hoc and not 
formalized.  Same with closing off subsets of nodes we just have a bunch 
of bash scripts that we have rolled to do that.

-Paul Edmon-


On 08/14/2018 05:16 AM, Pablo Llopis wrote:
> Dear SLURM users,
>
> I was wondering what kind of tools the community is using for 
> orchestrating SLURM operations.
>
> For instance, say you want to execute an operation in the cluster 
> which requires draining the nodes first. What kind of tools are you 
> using to automate the state machine that would go through the 
> draining, applying the operation, then finally undraining the nodes? 
> (maybe even more convoluted procedures)
>
> While it is possible to do these operations in a semi-manual fashion 
> by using a combination of automated tasks (scontrol and some 
> ansible/mco/bolt/whatever), this will usually result in manually 
> transitioning between drain -> apply operatation -> undrain.  The 
> disadvantage of this is the overhead of keeping track of the state of 
> draining nodes (some of our jobs can run for many weeks). In addition, 
> if a set of nodes are drained at midnight or during the weekend, no 
> jobs will be able to run until an operator triggers the next step, 
> which means wasting precious computing resources with idle hours :)
> This is where an orchestration tool would come in handy.
>
> For doing reboots, scontrol reboot almost does all of this already, 
> but there may be other, more complex operations to be done in a 
> similar fashion.
>
> Integration with a possible built-in healthcheck is also something to 
> consider, as the orchestration logic would need to take care of 
> disabling the healthcheck funcionality that automatically 
> restores/resumes drained nodes to avoid conflicts.
>
> I would like to learn how the community deals with these kinds of 
> operations, whether you are using Open Source tools, or you developed 
> your own orchestration framework. Maybe you developed your own 
> SLURM-specific tools to deal with this?
>
> Thanks!
> Pablo
>




More information about the slurm-users mailing list