[slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

taleintervenor at sjtu.edu.cn taleintervenor at sjtu.edu.cn
Tue May 3 07:46:38 UTC 2022


Hi, all:

 

We need to detect some problem at job end timepoint, so we write some
detection script in slurm epilog, which should drain the node if check is
not passed.

I know exit epilog with non-zero code will make slurm automatically drain
the node. But in such way, drain reason will all be marked as "Epilog
error". Then our auto-repair program will have trouble to determine how to
repair the node.

Another way is call scontrol directly from epilog to drain the node, but
from official doc https://slurm.schedmd.com/prolog_epilog.html it wrote:

Prolog and Epilog scripts should be designed to be as short as possible and
should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). .
Slurm commands in these scripts can potentially lead to performance issues
and should not be used.

So what is the best way to drain node from epilog with a self-defined
reason, or tell slurm to add more verbose message besides "Epilog error"
reason?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220503/c21c0e7c/attachment.htm>


More information about the slurm-users mailing list