[slurm-users] stopping job array after N failed jobs in row

Wed Aug 2 05:44:27 UTC 2023

Daniel Letai <dani at letai.org.il> writes:

> Not sure about automatically canceling a job array, except perhaps by submitting 2 consecutive arrays - first of size 20, and the other with the rest of
> the elements and a dependency of afterok. That said, a single job in a job array in Slurm documentation is referred to as a task. I personally prefer
> element, as in array element.
>
> Consider creating a batch job with:
>
> arrayid=$(sbatch --parsable --array=0-19 array-job.sh)
>
> sbatch --dependency=afterok:$arrayid --array=20-50000 array-job.sh
>
> I'm not near a cluster right now, so can't test for correctness. The main drawback is of course if 20 jobs takes a long time to complete, and there are
> enough resources to run more than 20 jobs in parallel, all those resources will be wasted for the duration. Not a big issue in busy clusters, as some
> other job will run in the meantime, but this will impact completion time of the array, if 20 jobs use significantly less than the resources available.

I think running an initial subarray is a good idea, since, once it has
completed, it allows the user to check whether the right amount of
resources were requested.  I often find users don't do this and end up,
say, specifying 10 or 100 times more memory than actually needed for an
array of several thousand jobs.  This is obviously a problem even if the
jobs all completed successfully.

Cheers,

Loris

> It might be possible to depend on afternotok of the first 20 tasks, to run --wrap="scancel $arrayid"
>
> Maybe something like:
>
> sbatch --array=1-50000 array-job.sh
>
> with
>
> cat array-job.sh
>
>  #!/bin/bash
>
>  srun myjob.sh $SLURM_ARRAY_TASK_ID & 
>
>  [[ $SLURM_ARRAY_TASK_ID -gt 20  ]] && srun -d afternotok:${SLURM_ARRAY_JOB_ID}_1,afternotok:${SLURM_ARRAY_JOB_ID}_2,...afternotok:$
>  {SLURM_ARRAY_JOB_ID}_20 scancel $SLURM_ARRAY_JOB_ID
>
> Will also work. Untested, use at your own risk.
>
> The other OTHER approach might be to use some epilog (or possibly epilogslurmctld) to log exit codes for first 20 tasks in each array, and cancel the
> array if non-zero. This is a global approach which will affect all job arrays, so might not be appropriate for your use case.
>
> On 01/08/2023 16:48:47, Josef Dvoracek wrote:
>
>>  my users found the beauty of job arrays, and they tend to use it every then and now. 
>>
>>  Sometimes human factor steps in, and something is wrong in job array specification, and cluster "works" on one failed array job after another. 
>>
>>  Isn't there any way how to automatically stop/scancel/? job array after, let say, 20 failed array jobs in row? 
>>
>>  So far my experience is, if first ~20 array jobs go right, there is no catastrophic failure in sbatch-file. If they fail, usually it's bad and there is no
>>  sense to crunch the remaining thousands of job array jobs. 
>>
>>  OT: what is the correct terminology for one item in job array... sub-job? job-array-job? :) 
>>
>>  cheers 
>>
>>  josef 
>-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin