[slurm-users] How do you orchestrate SLURM operations, what tools do you use?

Michael Jennings mej at lanl.gov
Wed Aug 15 12:57:06 MDT 2018


On Wednesday, 15 August 2018, at 10:01:19 (-0400),
Paul Edmon wrote:

> On 08/14/2018 05:16 AM, Pablo Llopis wrote:
> >
> >Integration with a possible built-in healthcheck is also something
> >to consider, as the orchestration logic would need to take care of
> >disabling the healthcheck funcionality that automatically
> >restores/resumes drained nodes to avoid conflicts.
>
> So we use NHC for our automatic node closer.  For reopening we have
> a series of scripts that we use but they are all ad hoc and not
> formalized.  Same with closing off subsets of nodes we just have a
> bunch of bash scripts that we have rolled to do that.

Every site is different, and so your needs may vary.  But for those
sites that use NHC, I just wanted to note how it handles the conflict
avoidance issue that was mentioned.

If a node comes back clean (all NHC tests passed), and if MARK_OFFLINE
is set to 1, then NHC will kick off a helper script called
"node-mark-online" to bring the node back into service.  The version
of node-mark-online that comes with NHC will *ONLY* return nodes to
service if the SLURM "Reason" field for that node (as shown by, e.g.,
"sinfo -Rl") starts with "NHC:" (meaning that the message was put
there by NHC itself).  Any nodes that are in a DRAIN or DOWN state
that were *not* drained by NHC itself will be left alone.

This way, if your Ops staff need to take a node out of service for
some reason, NHC won't try to put it back in service just because it
passes all tests.

So you can safely use NHC to orchestrate operations across your
compute nodes -- relying on it to both drain unpatched nodes initially
and then restore them to service afterward -- OR you can use some
other orchestration tool like Ansible knowing that NHC will not
interfere in its activities.

As far as tool recommendations go, apart from NHC, we use a
LANL-created utility called "pexec" which can leverage netgroups,
SLURM node states (like "allup" or "alldown"), node ranges, and so on.
It's available at https://github.com/hpc/pexec
We also use pdsh and are planning to investigate clush and other
options in the near future.

HTH!
Michael

-- 
Michael E. Jennings <mej at lanl.gov>
HPC Systems Team, Los Alamos National Laboratory
Bldg. 03-2327, Rm. 2341     W: +1 (505) 606-0605



More information about the slurm-users mailing list