[slurm-users] user-provided epilog does not always run

Alex Chekholko alex at calicolabs.com
Thu Apr 25 21:34:26 UTC 2019

Hi all,

My expectation is that the epilog script gets run no matter what happens to
the job (fails, canceled, timeout, etc). Is that true, or are there corner
cases?  I hope I correctly understand the intended behavior.

My OS is Ubuntu 18.04.2 LTS and my SLURM is 18.08.7 built from source.

The end goal is to get the user to clean up their own temp files in their
own epilog script to make the job more portable between clusters.  So in
their case, the epilog has some "rm -rf" in it which can be slow.

Here is my simple user epilog script:

$ cat example-user-epilog.sh

# table is https://slurm.schedmd.com/prolog_epilog.html
echo "inside my own epilog"
printenv | grep SLURM

Here is my simple user job script:

$ cat example-user-sbatch.sh

echo "starting my job"

#first and only task inside my job
srun --epilog=/home/alex/example-user-epilog.sh sleep 600

What would be the cases where this epilog script would not run?

I tried just running the job so it completes normally; I tried running it
with a short timelimit so it gets canceled by timeout, I tried scancel to
cancel the job, and I also tried just killing my sleep command on the node
so the job fails.  So that's four distinct cases.  It seemed to work OK, as
evidenced by getting the list of SLURM env vars in my output file.

However, if I now amend the epilog script to include a sleep command, it
seems to get killed half-way through.

$ cat example-user-epilog.sh

# table is https://slurm.schedmd.com/prolog_epilog.html
echo "inside my own epilog"
sleep 10
printenv | grep SLURM

I thought maybe there was some epilog timeout but specifically the
PrologEpilogTimeout is set to the default in my case.

$ scontrol show config | grep -i time
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2019-04-25T09:03:20
EioTimeout              = 60
EpilogMsgTime           = 2000 usec
GetEnvTimeout           = 2 sec
GroupUpdateTime         = 600 sec
KeepAliveTime           = SYSTEM_DEFAULT
LogTimeFormat           = iso8601_ms
MessageTimeout          = 10 sec
OverTimeLimit           = 0 min
PrologEpilogTimeout     = 65534
ResumeTimeout           = 60 sec
SchedulerTimeSlice      = 30 sec
SlurmctldTimeout        = 300 sec
SlurmdTimeout           = 300 sec
SuspendTime             = NONE
SuspendTimeout          = 30 sec
TCPTimeout              = 2 sec
UnkillableStepTimeout   = 60 sec
WaitTime                = 0 sec

Looking through the slurmd logs on the compute nodes, I sometimes see a
message like
[2019-04-25T05:54:52.241] epilog for job 17280 ran for 12 seconds
so I'm guessing that gets reported for epilogs which run "long".

But in this case they seem to get killed or not run at all.

How can I troubleshoot further?

In case this is some kind of cgroups thing, I do have
in my slurmd systemd unit file.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190425/a77fbfa4/attachment.html>

More information about the slurm-users mailing list