<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr">Hi<div><br></div><div>I want to run an epilogctld after all parts of an array job have completed in order to clean up an on demand filesystem created in the prologctld.</div><div><br></div><div>First I though I could just assume that I could run the epilog after the completion of the final job step until I realised that they might not take the same amount of time and though Slurm must be able to handle this logic I shall read the docs!</div><div><br></div><div>Since the array job docs page says:</div><div>"A job which is to be dependent upon an entire job array should specify itself dependent upon the ArrayJobID."</div><div><br></div><div>I thought I would poll this using a horrible bit of bash and sed in my pro/epilog and dump to a log file</div><div><br></div><div>I then submit a synthetic array job with 10 elements where each element sleeps for a random amount of time (upto 10 seconds) </div><div><pre>prolog jobid=511,arraytaskid=2,SLURM_ARRAY_JOB_ID=509
prolog jobid=512,arraytaskid=3,SLURM_ARRAY_JOB_ID=509
prolog jobid=513,arraytaskid=4,SLURM_ARRAY_JOB_ID=509
prolog jobid=514,arraytaskid=5,SLURM_ARRAY_JOB_ID=509
prolog jobid=515,arraytaskid=6,SLURM_ARRAY_JOB_ID=509
prolog jobid=516,arraytaskid=7,SLURM_ARRAY_JOB_ID=509
prolog jobid=517,arraytaskid=8,SLURM_ARRAY_JOB_ID=509
prolog jobid=509,arraytaskid=10,SLURM_ARRAY_JOB_ID=509
prolog jobid=510,arraytaskid=1,SLURM_ARRAY_JOB_ID=509
prolog jobid=518,arraytaskid=9,SLURM_ARRAY_JOB_ID=509
epilog jobid=513,arraytaskid=4,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING
epilog jobid=514,arraytaskid=5,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING
epilog jobid=515,arraytaskid=6,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING
epilog jobid=518,arraytaskid=9,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING
epilog jobid=512,arraytaskid=3,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING
epilog jobid=517,arraytaskid=8,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING
epilog jobid=509,arraytaskid=10,SLURM_ARRAY_JOB_ID=509,JobState(509)=COMPLETING
epilog jobid=516,arraytaskid=7,SLURM_ARRAY_JOB_ID=509,JobState(509)=COMPLETED
epilog jobid=511,arraytaskid=2,SLURM_ARRAY_JOB_ID=509,JobState(509)=COMPLETED
epilog jobid=510,arraytaskid=1,SLURM_ARRAY_JOB_ID=509,JobState(509)=COMPLETED
</pre></div><div>Slurm seems to think that the array job is complete after the final array task has completed even thought there are 3 more tasks running. If I was to tear down the filesystem when the job goes into the completing step as I was planning to then I would cause an error in tasks 7,2 and 1. This would also happen if I was using a job dependency like --depend=afterany:509 to try and do the same thing.</div><div><br></div><div>I have repeated it 2x with some timestamps in the pro and epliog scripts to show that it's not due to a race condition printing, there as a good few seconds between the completing job and the remaining jobs that are yet to complete.</div><div><pre>prolog 15:42:46.456925345 jobid=530,arraytaskid=1,SLURM_ARRAY_JOB_ID=529
prolog 15:42:46.461794268 jobid=532,arraytaskid=3,SLURM_ARRAY_JOB_ID=529
prolog 15:42:46.460449387 jobid=531,arraytaskid=2,SLURM_ARRAY_JOB_ID=529
prolog 15:42:46.464872996 jobid=533,arraytaskid=4,SLURM_ARRAY_JOB_ID=529
prolog 15:42:46.466919112 jobid=534,arraytaskid=5,SLURM_ARRAY_JOB_ID=529
prolog 15:42:46.469350534 jobid=536,arraytaskid=7,SLURM_ARRAY_JOB_ID=529
prolog 15:42:46.472210879 jobid=537,arraytaskid=8,SLURM_ARRAY_JOB_ID=529
prolog 15:42:46.473099772 jobid=538,arraytaskid=9,SLURM_ARRAY_JOB_ID=529
prolog 15:42:46.476322345 jobid=535,arraytaskid=6,SLURM_ARRAY_JOB_ID=529
prolog 15:42:46.479001241 jobid=529,arraytaskid=10,SLURM_ARRAY_JOB_ID=529
epilog 15:42:48.725625849 jobid=536,arraytaskid=7,SLURM_ARRAY_JOB_ID=529,JobState(529)=RUNNING
epilog 15:42:49.731580409 jobid=529,arraytaskid=10,SLURM_ARRAY_JOB_ID=529,JobState(529)=COMPLETING
epilog 15:42:50.597469868 jobid=532,arraytaskid=3,SLURM_ARRAY_JOB_ID=529,JobState(529)=COMPLETED
epilog 15:42:50.678213697 jobid=533,arraytaskid=4,SLURM_ARRAY_JOB_ID=529,JobState(529)=COMPLETED
epilog 15:42:50.714197275 jobid=534,arraytaskid=5,SLURM_ARRAY_JOB_ID=529,JobState(529)=COMPLETED
epilog 15:42:50.721083470 jobid=531,arraytaskid=2,SLURM_ARRAY_JOB_ID=529,JobState(529)=COMPLETED
epilog 15:42:53.707315733 jobid=530,arraytaskid=1,SLURM_ARRAY_JOB_ID=529,JobState(529)=COMPLETED
epilog 15:42:53.728610476 jobid=538,arraytaskid=9,SLURM_ARRAY_JOB_ID=529,JobState(529)=COMPLETED
epilog 15:42:55.733841379 jobid=535,arraytaskid=6,SLURM_ARRAY_JOB_ID=529,JobState(529)=COMPLETED
epilog 15:42:56.728246980 jobid=537,arraytaskid=8,SLURM_ARRAY_JOB_ID=529,JobState(529)=COMPLETED
prolog 15:43:27.518215088 jobid=541,arraytaskid=2,SLURM_ARRAY_JOB_ID=539
prolog 15:43:27.519780788 jobid=540,arraytaskid=1,SLURM_ARRAY_JOB_ID=539
prolog 15:43:27.520344497 jobid=542,arraytaskid=3,SLURM_ARRAY_JOB_ID=539
prolog 15:43:27.521069023 jobid=543,arraytaskid=4,SLURM_ARRAY_JOB_ID=539
prolog 15:43:27.523453937 jobid=544,arraytaskid=5,SLURM_ARRAY_JOB_ID=539
prolog 15:43:27.526033806 jobid=545,arraytaskid=6,SLURM_ARRAY_JOB_ID=539
prolog 15:43:27.527783240 jobid=546,arraytaskid=7,SLURM_ARRAY_JOB_ID=539
prolog 15:43:27.530369206 jobid=539,arraytaskid=10,SLURM_ARRAY_JOB_ID=539
prolog 15:43:27.534797886 jobid=547,arraytaskid=8,SLURM_ARRAY_JOB_ID=539
prolog 15:43:27.538816566 jobid=548,arraytaskid=9,SLURM_ARRAY_JOB_ID=539
epilog 15:43:29.758028002 jobid=544,arraytaskid=5,SLURM_ARRAY_JOB_ID=539,JobState(539)=RUNNING
epilog 15:43:30.831216132 jobid=540,arraytaskid=1,SLURM_ARRAY_JOB_ID=539,JobState(539)=RUNNING
epilog 15:43:30.833185667 jobid=546,arraytaskid=7,SLURM_ARRAY_JOB_ID=539,JobState(539)=RUNNING
epilog 15:43:31.799390789 jobid=541,arraytaskid=2,SLURM_ARRAY_JOB_ID=539,JobState(539)=RUNNING
epilog 15:43:31.801698014 jobid=543,arraytaskid=4,SLURM_ARRAY_JOB_ID=539,JobState(539)=RUNNING
epilog 15:43:34.837543145 jobid=545,arraytaskid=6,SLURM_ARRAY_JOB_ID=539,JobState(539)=COMPLETING
epilog 15:43:34.839544475 jobid=539,arraytaskid=10,SLURM_ARRAY_JOB_ID=539,JobState(539)=COMPLETING
epilog 15:43:35.667776053 jobid=542,arraytaskid=3,SLURM_ARRAY_JOB_ID=539,JobState(539)=COMPLETED
epilog 15:43:35.835100302 jobid=548,arraytaskid=9,SLURM_ARRAY_JOB_ID=539,JobState(539)=COMPLETED
epilog 15:43:37.827816869 jobid=547,arraytaskid=8,SLURM_ARRAY_JOB_ID=539,JobState(539)=COMPLETED
</pre></div><div>short of making every task in the array job create a file on completion and testing for the existence of all files, is there a way to reliably detect when all tasks in the array have completed?</div><div><br></div><div>I'm using a newly installed system based on slurm 18.08.8 as packaged in the OpenHPC repos.</div><div><br></div><div>Thanks</div><div><br></div><div>Antony</div></div></div></div></div></div></div></div>