[slurm-users] stopping job array after N failed jobs in row
Josef Dvoracek
jose at fzu.cz
Tue Aug 1 13:48:47 UTC 2023
my users found the beauty of job arrays, and they tend to use it every
then and now.
Sometimes human factor steps in, and something is wrong in job array
specification, and cluster "works" on one failed array job after another.
Isn't there any way how to automatically stop/scancel/? job array after,
let say, 20 failed array jobs in row?
So far my experience is, if first ~20 array jobs go right, there is no
catastrophic failure in sbatch-file. If they fail, usually it's bad and
there is no sense to crunch the remaining thousands of job array jobs.
OT: what is the correct terminology for one item in job array...
sub-job? job-array-job? :)
cheers
josef
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4247 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230801/a8820466/attachment.bin>
More information about the slurm-users
mailing list