[slurm-users] stopping job array after N failed jobs in row

Josef Dvoracek jose at fzu.cz
Tue Aug 1 13:48:47 UTC 2023


my users found the beauty of job arrays, and they tend to use it every 
then and now.

Sometimes human factor steps in, and something is wrong in job array 
specification, and cluster "works" on one failed array job after another.

Isn't there any way how to automatically stop/scancel/? job array after, 
let say, 20 failed array jobs in row?

So far my experience is, if first ~20 array jobs go right, there is no 
catastrophic failure in sbatch-file. If they fail, usually it's bad and 
there is no sense to crunch the remaining thousands of job array jobs.

OT: what is the correct terminology for one item in job array... 
sub-job? job-array-job? :)

cheers

josef


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4247 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230801/a8820466/attachment.bin>


More information about the slurm-users mailing list