I'm working on the Slurm integration in our Toil workflow runner project. I'm having a problem where an `sbatch` command to submit a job to Slurm can fail (with exit code 1 and message "sbatch: error: Batch job submission failed: Socket timed out on send/recv operation", in my case, but possibly in other ways), but the job can still actually have been submitted, and can still execute.

This causes a problem for Toil because right now, when it sees a submission attempt fail, it backs off and submits the job again a little later. But Toil can't handle multiple copies of the same job running at once, and if a submission appears to the client to have failed but actually succeeded, it's possible to get into that situation if you just submit again.

When an sbatch command fails, is it possible to detect cases where the cluster will still execute the job? (For example, is it guaranteed that the job ID will be available on standard output on the client if the job is going to execute on the cluster, no matter when in the client process it might get a socket disconnection, allowing the job to be inquired about later?) Do I maybe need to tag my jobs with unique identifiers myself so I can poll for them in the queue after a supposedly-failed submission?

Is it possible to write an idempotent sbatch command, where it can be run any number of times but will only actually submit one copy of the job?

Is the Slurm REST API somehow more transactional, or able to promise somehow that a job will not actually go into the queue without the client having acknowledged receipt of the job's assigned ID?

Thanks,

-Adam

Adam Novak (He/Him)
Senior Software Engineer
Computational Genomics Lab
UC Santa Cruz Genomics Institute
"Revealing life’s code."

Personal Feedback: https://forms.gle/UXZhZc123knF65Dw5