[slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

Tue May 3 13:34:59 UTC 2022

I've done similar by having the epilog touch a file, then have the node
health check (LBNL NHC) act on that file's presence/contents later to do
the heavy lifting. There's a window of time/delay where the reason is
"Epilog error" before the health check corrects it, but if that's tolerable
this makes for a fast epilog script.

griznog

On Tue, May 3, 2022 at 2:49 AM <taleintervenor at sjtu.edu.cn> wrote:

> Hi, all:
>
>
>
> We need to detect some problem at job end timepoint, so we write some
> detection script in slurm epilog, which should drain the node if check is
> not passed.
>
> I know exit epilog with non-zero code will make slurm automatically drain
> the node. But in such way, drain reason will all be marked as *“Epilog
> error”*. Then our auto-repair program will have trouble to determine how
> to repair the node.
>
> Another way is call *scontrol* directly from epilog to drain the node,
> but from official doc https://slurm.schedmd.com/prolog_epilog.html it
> wrote:
>
> *Prolog and Epilog scripts should be designed to be as short as possible
> and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc).
> … Slurm commands in these scripts can potentially lead to performance
> issues and should not be used.*
>
> So what is the best way to drain node from epilog with a self-defined
> reason, or tell slurm to add more verbose message besides *“Epilog error”
> *reason?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220503/b65f8684/attachment.htm>