[slurm-users] Get Job Array information in Epilog script
Timo Rothenpieler
timo.rothenpieler at uni-bremen.de
Fri Mar 17 12:28:06 UTC 2023
On 17/03/2023 13:11, William Brown wrote:
> We create the temporary directories using SLURM_JOB_ID, and that works
> fine with Job Arrays so far as I can see. Don't you have a problem
> if a user has multiple jobs on the same node?
>
> William
Ours users just have /work/$username, anything below that the respective
job script creates on its own.
So there's various different schemes that appear in /work.
Recently some users have started submitting smaller jobs, of which
multiple run on the same node.
So their /work dir gets littered with tons of no longer used per-job
subdirs.
Since they've grown to rely on the Epilog script cleaning up /work when
their last job on the node finishes, that's never been a problem.
But now we ran out of storage on /work multiple times, since there are
so many jobs from some users that a node never was fully vacant of their
jobs, so never got cleaned up.
The subdirs pretty much always use one of the three styles from the script:
"${SLURM_JOB_ID}", "${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}" or
"${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}".
I don't see how that'd cause problems with multiple jobs. Since all
those will be unique per job?
> On Fri, 17 Mar 2023 at 11:17, Timo Rothenpieler
> <timo.rothenpieler at uni-bremen.de> wrote:
>>
>> Hello!
>>
>> I'm currently facing a bit of an issue regarding cleanup after a job
>> completed.
>>
>> I've added the following bit of Shellscript to our clusters Epilog script:
>>
>>> for d in "${SLURM_JOB_ID}" "${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}" "${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}"; do
>>> WORKDIR="/work/${SLURM_JOB_USER}/${d}"
>>> if [ -e "${WORKDIR}" ]; then
>>> rm -rf "${WORKDIR}"
>>> fi
>>> done
>>
>> However, it did not end up working to clean up working directories of
>> Array-Jobs.
>>
>> After some investigation, I found the reason in the documentation:
>>
>> > SLURM_ARRAY_JOB_ID/SLURM_ARRAY_TASK_ID: [...]
>> > Available in PrologSlurmctld, SrunProlog, TaskProlog,
>> EpilogSlurmctld, SrunEpilog and TaskEpilog.
>>
>> So, now I wonder... how am I supposed to get that information in the
>> Epilog script? The whole job is part of an array, so how do I get the
>> information at a job level?
>>
>> The "obvious alternative" based on that documentation would be to put
>> that bit of code into a TaskEpilog script. But my understanding of that
>> is that the script would run after each one of potentially multiple
>> srun-launched tasks in the same job, and would then clean up the
>> work-dir while the job would still use it?
>>
>> I only want to do that bit of cleanup when the job is terminating.
>>
>
More information about the slurm-users
mailing list