[slurm-users] How do you orchestrate SLURM operations, what tools do you use?

Pablo Llopis pablo.llopis at gmail.com
Tue Aug 14 03:16:07 MDT 2018


Dear SLURM users,

I was wondering what kind of tools the community is using for orchestrating
SLURM operations.

For instance, say you want to execute an operation in the cluster which
requires draining the nodes first. What kind of tools are you using to
automate the state machine that would go through the draining, applying the
operation, then finally undraining the nodes? (maybe even more convoluted
procedures)

While it is possible to do these operations in a semi-manual fashion by
using a combination of automated tasks (scontrol and some
ansible/mco/bolt/whatever), this will usually result in manually
transitioning between drain -> apply operatation -> undrain.  The
disadvantage of this is the overhead of keeping track of the state of
draining nodes (some of our jobs can run for many weeks). In addition, if a
set of nodes are drained at midnight or during the weekend, no jobs will be
able to run until an operator triggers the next step, which means wasting
precious computing resources with idle hours :)
This is where an orchestration tool would come in handy.

For doing reboots, scontrol reboot almost does all of this already, but
there may be other, more complex operations to be done in a similar fashion.

Integration with a possible built-in healthcheck is also something to
consider, as the orchestration logic would need to take care of disabling
the healthcheck funcionality that automatically restores/resumes drained
nodes to avoid conflicts.

I would like to learn how the community deals with these kinds of
operations, whether you are using Open Source tools, or you developed your
own orchestration framework. Maybe you developed your own SLURM-specific
tools to deal with this?

Thanks!
Pablo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180814/33466730/attachment.html>


More information about the slurm-users mailing list