[slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

Paul Edmon pedmon at cfa.harvard.edu
Tue May 3 13:24:20 UTC 2022


We've invoked scontrol in our epilog script for years to close off nodes 
with out any issue.  What the docs are really referring to is gratuitous 
use of those commands.  If you have those commands well circumscribed 
(i.e. only invoked when you have to actually close a node) and only use 
them when you absolutely have no other work around then you should be fine.

-Paul Edmon-

On 5/3/2022 3:46 AM, taleintervenor at sjtu.edu.cn wrote:
>
> Hi, all:
>
> We need to detect some problem at job end timepoint, so we write some 
> detection script in slurm epilog, which should drain the node if check 
> is not passed.
>
> I know exit epilog with non-zero code will make slurm automatically 
> drain the node. But in such way, drain reason will all be marked as 
> *“Epilog error”*. Then our auto-repair program will have trouble to 
> determine how to repair the node.
>
> Another way is call *scontrol* directly from epilog to drain the node, 
> but from official doc https://slurm.schedmd.com/prolog_epilog.html it 
> wrote:
>
> /Prolog and Epilog scripts should be designed to be as short as 
> possible and should not call Slurm commands (e.g. squeue, scontrol, 
> sacctmgr, etc). … Slurm commands in these scripts can potentially lead 
> to performance issues and should not be used./
>
> So what is the best way to drain node from epilog with a self-defined 
> reason, or tell slurm to add more verbose message besides *“Epilog 
> error” *reason?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220503/9ef76a60/attachment.htm>


More information about the slurm-users mailing list