[slurm-users] stopping job array after N failed jobs in row

Michael DiDomenico mdidomenico4 at gmail.com
Wed Aug 2 16:01:16 UTC 2023


On Tue, Aug 1, 2023 at 3:27 PM Daniel Letai <dani at letai.org.il> wrote:
> The other OTHER approach might be to use some epilog (or possibly epilogslurmctld) to log exit codes for first 20 tasks in each array, and cancel the array if non-zero. This is a global approach which will affect all job arrays, so might not be appropriate for your use case.

you can setup task prolog/epilog.  just test for the error condition
inthe task epilog and then cancel your array if need be

https://slurm.schedmd.com/prolog_epilog.html

i've not tried it, nor how it relates to array's but might work



More information about the slurm-users mailing list