[slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

Tue May 3 21:22:25 UTC 2022

On Tuesday, 03 May 2022, at 15:46:38 (+0800),
taleintervenor at sjtu.edu.cn wrote:

> We need to detect some problem at job end timepoint, so we write some
> detection script in slurm epilog, which should drain the node if check is
> not passed.
>
> I know exit epilog with non-zero code will make slurm automatically drain
> the node. But in such way, drain reason will all be marked as "Epilog
> error". Then our auto-repair program will have trouble to determine how to
> repair the node.
>
> Another way is call scontrol directly from epilog to drain the node, but
> from official doc https://slurm.schedmd.com/prolog_epilog.html  it wrote:
>
> Prolog and Epilog scripts should be designed to be as short as possible and
> should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). .
> Slurm commands in these scripts can potentially lead to performance issues
> and should not be used.
>
> So what is the best way to drain node from epilog with a self-defined
> reason, or tell slurm to add more verbose message besides "Epilog error"
> reason?

Invoking `scontrol` from a prolog/epilog script to simply alter nodes'
state and/or reason fields is totally fine.  Many sites (including
ours) use LBNL NHC for all or part of their epilogs' post-job "sanity
checking" of nodes, and -- knock on renewable bamboo -- there have
been no concurrency issues (loops, deadlocks, etc.) reported to either
project to date. :-)

If it helps, I had similar concerns about invoking the `squeue`
command from an NHC run in order to gather job data.  The Man Himself
(Moe Jette, original creator of Slurm and co-founder of SchedMD) was
kind enough to weigh in on the issue (literally, the Issue:
https://github.com/mej/nhc/issues/15), saying in part,

     "I do not believe that you could create a deadlock situation from
      NHC (if you did, I would consider that a Slurm bug)."
                -- https://github.com/mej/nhc/issues/15#issuecomment-217174363

That's not to say you should go hog-wild and fill your epilog script
with all the `s`-commands you can think of.... ;-)  But you can at
least be reasonably confident that draining/offlining a node from an
epilog script will not cause your cluster to implode!

Michael

--
Michael E. Jennings <mej at lanl.gov> - [PGPH: he/him/his/Mr]  --  hpc.lanl.gov
HPC Systems Engineer   --   Platforms Team   --  HPC Systems Group (HPC-SYS)
Strategic Computing Complex, Bldg. 03-2327, Rm. 2341    W: +1 (505) 606-0605
Los Alamos National Laboratory,  P.O. Box 1663,  Los Alamos, NM   87545-0001