[slurm-users] How do you orchestrate SLURM operations, what tools do you use?
Paul Edmon
pedmon at cfa.harvard.edu
Wed Aug 15 08:01:19 MDT 2018
So we use NHC for our automatic node closer. For reopening we have a
series of scripts that we use but they are all ad hoc and not
formalized. Same with closing off subsets of nodes we just have a bunch
of bash scripts that we have rolled to do that.
-Paul Edmon-
On 08/14/2018 05:16 AM, Pablo Llopis wrote:
> Dear SLURM users,
>
> I was wondering what kind of tools the community is using for
> orchestrating SLURM operations.
>
> For instance, say you want to execute an operation in the cluster
> which requires draining the nodes first. What kind of tools are you
> using to automate the state machine that would go through the
> draining, applying the operation, then finally undraining the nodes?
> (maybe even more convoluted procedures)
>
> While it is possible to do these operations in a semi-manual fashion
> by using a combination of automated tasks (scontrol and some
> ansible/mco/bolt/whatever), this will usually result in manually
> transitioning between drain -> apply operatation -> undrain. The
> disadvantage of this is the overhead of keeping track of the state of
> draining nodes (some of our jobs can run for many weeks). In addition,
> if a set of nodes are drained at midnight or during the weekend, no
> jobs will be able to run until an operator triggers the next step,
> which means wasting precious computing resources with idle hours :)
> This is where an orchestration tool would come in handy.
>
> For doing reboots, scontrol reboot almost does all of this already,
> but there may be other, more complex operations to be done in a
> similar fashion.
>
> Integration with a possible built-in healthcheck is also something to
> consider, as the orchestration logic would need to take care of
> disabling the healthcheck funcionality that automatically
> restores/resumes drained nodes to avoid conflicts.
>
> I would like to learn how the community deals with these kinds of
> operations, whether you are using Open Source tools, or you developed
> your own orchestration framework. Maybe you developed your own
> SLURM-specific tools to deal with this?
>
> Thanks!
> Pablo
>
More information about the slurm-users
mailing list