[slurm-users] Clean Up Scratch After Failed Job

Jason Simms jsimms1 at swarthmore.edu
Tue Oct 10 15:59:37 UTC 2023

Hello all,

Our template scripts for Slurm include a workflow to copy files to a
scratch space prior to running a job, and then copying any output files,
etc. back to the original submit directory on job completion, and then
finally cleaning up (deleting) the scratch space before exiting. This works
great until a job fails or is requeued, in which case the scratch space
isn't cleaned up.

In the past, I've run a cron job that deletes any material in scratch that
hasn't been modified in any number of days beyond the max length of a job,
but that can still allow "zombie" material to remain in scratch for quite a
while. I'm intrigued by using an epilog script that is triggered after each
job completes (whether normally or due to failure, requeuing, etc.) to
accomplish the same task more efficiently and consistently.

The first question is in which context would I run the epilog. I presume
I'd want to run it after a job completes entirely, so looking at the table,
I think I'd want an Epilog script to run on the compute node. Reading the
documentation, however, it is unclear to me that all variables I would need
will be available in such a script. We use the variables $USER,
$SLURM_JOB_NAME, and $SLURM_JOB_ID to create a path within scratch unique
to each job.

Specifically, however, the documentation for $SLURM_JOB_NAME says:

"SLURM_JOB_NAME Name of the job. Available in PrologSlurmctld, SrunProlog,
TaskProlog, EpilogSlurmctld, SrunEpilog and TaskEpilog."

So it doesn't seem to be available in the appropriate context. Thinking
about it, however, I presume if I only use the $SLURM_JOB_ID and $USER (and
then $SLURM_JOB_USER in the epilog script) that the path would still be
unique; meaning, I could just not use the job name.

Anyway, if anyone has any thoughts or examples of setting up something like
this, I'd appreciate it!

Warmest regards,

*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231010/71802703/attachment.htm>

More information about the slurm-users mailing list