Reliable, Atomic, Idempotent, or Transactional Job Submission?

List overview All Threads
Download

newer

older

Issue with coordinator increasing...

Get the list of added dynamic nodes

Adam Novak

17 Feb 2026 17 Feb '26

12:56 p.m.

I'm working on the Slurm integration in our Toil workflow runner project. I'm having a problem where an `sbatch` command to submit a job to Slurm can fail (with exit code 1 and message "sbatch: error: Batch job submission failed: Socket timed out on send/recv operation", in my case, but possibly in other ways), but the job can still actually have been submitted, and can still execute.

This causes a problem for Toil because right now, when it sees a submission attempt fail, it backs off and submits the job again a little later. But Toil can't handle multiple copies of the same job running at once, and if a submission appears to the client to have failed but actually succeeded, it's possible to get into that situation if you just submit again.

When an sbatch command fails, is it possible to detect cases where the cluster will still execute the job? (For example, is it guaranteed that the job ID will be available on standard output on the client if the job is going to execute on the cluster, no matter when in the client process it might get a socket disconnection, allowing the job to be inquired about later?) Do I maybe need to tag my jobs with unique identifiers myself so I can poll for them in the queue after a supposedly-failed submission?

Is it possible to write an idempotent sbatch command, where it can be run any number of times but will only actually submit one copy of the job?

Is the Slurm REST API somehow more transactional, or able to promise somehow that a job will not actually go into the queue without the client having acknowledged receipt of the job's assigned ID?

Thanks, -Adam

-- Adam Novak (He/Him) Senior Software Engineer Computational Genomics Lab UC Santa Cruz Genomics Institute "Revealing life’s code." Personal Feedback: https://forms.gle/UXZhZc123knF65Dw5

Attachments:

attachment.html (text/html — 2.2 KB)

Show replies by date

Kevin Buckley

17 Feb 17 Feb

9:11 p.m.

On 2026/02/18 01:56, Adam Novak via slurm-users wrote:

...

... Toil can't handle multiple copies of the same job running at once ... Is it possible to write an idempotent sbatch command, where it can be run any number of times but will only actually submit one copy of the job?

Could you not make use of the

--dependency=singleton

constraint, to achieve something close to what your meta-scheduler needs?

From the sbatch manpage:

singleton This job can begin execution after any previously launched jobs sharing the same job name and user have terminated. In other words, only one job by that name and owned by that user can be running or suspended at any point in time. In a federation, a singleton dependency must be fulfilled on all clusters unless DependencyParameters=disable_remote_singleton is used in slurm.conf.

You would still need to catch any queued dupe(s) that your meta-scheduler created but there wouldn't be two running at once.

Adam Novak

18 Feb 18 Feb

11:04 a.m.

That could probably help; I'd still want to make the job names unique to prevent multiple workflows under one user from delaying each other, but I'd be able to have something much closer to correct without a lot of second-guessing the submission return code.

On Tue, Feb 17, 2026 at 9:12 PM Kevin Buckley via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

On 2026/02/18 01:56, Adam Novak via slurm-users wrote:

...
... Toil can't handle multiple copies of the same job running at once ... Is it possible to write an idempotent sbatch command, where it can be run any number of times but will only actually submit one copy of the job?

Could you not make use of the

--dependency=singleton

constraint, to achieve something close to what your meta-scheduler needs?

From the sbatch manpage:
 singleton
     This job can begin execution after any previously launched jobs
sharing the same job name and user have terminated. In other words, only one job by that name and owned by that user can be running or suspended at any point in time. In a federation, a singleton dependency must be fulfilled on all clusters unless DependencyParameters=disable_remote_singleton is used in slurm.conf.

You would still need to catch any queued dupe(s) that your meta-scheduler created but there wouldn't be two running at once.

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- Adam Novak (He/Him) Senior Software Engineer Computational Genomics Lab UC Santa Cruz Genomics Institute "Revealing life’s code." Personal Feedback: https://forms.gle/UXZhZc123knF65Dw5

Davide DelVento

11:26 p.m.

Another option, probably better, would be to use WCKeys. See for example how https://github.com/WFU-HPC/OOD-MultitenantApps solved a very similar problem exploiting WCKeys (and other things)

On Wed, Feb 18, 2026 at 9:08 AM Adam Novak via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

That could probably help; I'd still want to make the job names unique to prevent multiple workflows under one user from delaying each other, but I'd be able to have something much closer to correct without a lot of second-guessing the submission return code.

On Tue, Feb 17, 2026 at 9:12 PM Kevin Buckley via slurm-users < slurm-users@lists.schedmd.com> wrote:

...
On 2026/02/18 01:56, Adam Novak via slurm-users wrote:

...
... Toil can't handle multiple copies of the same job running at once ... Is it possible to write an idempotent sbatch command, where it can be

run

...
any number of times but will only actually submit one copy of the job?

Could you not make use of the

--dependency=singleton

constraint, to achieve something close to what your meta-scheduler needs?

From the sbatch manpage:
 singleton
     This job can begin execution after any previously launched jobs
sharing the same job name and user have terminated. In other words, only one job by that name and owned by that user can be running or suspended at any point in time. In a federation, a singleton dependency must be fulfilled on all clusters unless DependencyParameters=disable_remote_singleton is used in slurm.conf.

You would still need to catch any queued dupe(s) that your meta-scheduler created but there wouldn't be two running at once.

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- Adam Novak (He/Him) Senior Software Engineer Computational Genomics Lab UC Santa Cruz Genomics Institute "Revealing life’s code."

Personal Feedback: https://forms.gle/UXZhZc123knF65Dw5

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Adam Novak

19 Feb 19 Feb

1:35 p.m.

Davide, how do you envision WCKeys being used here? I can imagine assigning a globally unique WCKey to every job, to allow retrieving or identifying a job later, but it doesn't seem like the WCKeys system is intended to be used with thousands of distinct WCKey values. It looks like the multitenant setup uses just one WCKey value of "multitenant".

Thanks, -Adam

On Wed, Feb 18, 2026 at 11:27 PM Davide DelVento davide.quantum@gmail.com wrote:

...

Another option, probably better, would be to use WCKeys. See for example how https://github.com/WFU-HPC/OOD-MultitenantApps solved a very similar problem exploiting WCKeys (and other things)

On Wed, Feb 18, 2026 at 9:08 AM Adam Novak via slurm-users < slurm-users@lists.schedmd.com> wrote:

...
That could probably help; I'd still want to make the job names unique to prevent multiple workflows under one user from delaying each other, but I'd be able to have something much closer to correct without a lot of second-guessing the submission return code.

On Tue, Feb 17, 2026 at 9:12 PM Kevin Buckley via slurm-users < slurm-users@lists.schedmd.com> wrote:

...
On 2026/02/18 01:56, Adam Novak via slurm-users wrote:

...
... Toil can't handle multiple copies of the same job running at once ... Is it possible to write an idempotent sbatch command, where it can be

run

...
any number of times but will only actually submit one copy of the job?

Could you not make use of the

--dependency=singleton

constraint, to achieve something close to what your meta-scheduler needs?

From the sbatch manpage:
 singleton
     This job can begin execution after any previously launched jobs
sharing the same job name and user have terminated. In other words, only one job by that name and owned by that user can be running or suspended at any point in time. In a federation, a singleton dependency must be fulfilled on all clusters unless DependencyParameters=disable_remote_singleton is used in slurm.conf.

You would still need to catch any queued dupe(s) that your meta-scheduler created but there wouldn't be two running at once.

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- Adam Novak (He/Him) Senior Software Engineer Computational Genomics Lab UC Santa Cruz Genomics Institute "Revealing life’s code."

Personal Feedback: https://forms.gle/UXZhZc123knF65Dw5

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- Adam Novak (He/Him) Senior Software Engineer Computational Genomics Lab UC Santa Cruz Genomics Institute "Revealing life’s code." Personal Feedback: https://forms.gle/UXZhZc123knF65Dw5

Davide DelVento

3:20 p.m.

Hi Adam,

No, obviously that would be too much and perhaps no benefit. I was envisioning using a relatively small number of WCKeys in combination with something in the name of the jobs. That way you would need to query/parse only a limited number of jobs to see if the one(s) you want to resubmit are already running (or has completed), as opposed to the whole database.

Something similar to what the multitenant app does, as described on page 38-60 of https://github.com/WFU-HPC/OOD-MultitenantApps/blob/main/presentation.pdf You might take inspiration from them also about how to cram information into the job name!

Disclaimer: I have not done this myself, but I've seen their presentation and spoke with them and it seemed very interesting

HTH, Davide

On Thu, Feb 19, 2026 at 11:35 AM Adam Novak anovak@soe.ucsc.edu wrote:

...

Davide, how do you envision WCKeys being used here? I can imagine assigning a globally unique WCKey to every job, to allow retrieving or identifying a job later, but it doesn't seem like the WCKeys system is intended to be used with thousands of distinct WCKey values. It looks like the multitenant setup uses just one WCKey value of "multitenant".

Thanks, -Adam

On Wed, Feb 18, 2026 at 11:27 PM Davide DelVento davide.quantum@gmail.com wrote:

...
Another option, probably better, would be to use WCKeys. See for example how https://github.com/WFU-HPC/OOD-MultitenantApps solved a very similar problem exploiting WCKeys (and other things)

On Wed, Feb 18, 2026 at 9:08 AM Adam Novak via slurm-users < slurm-users@lists.schedmd.com> wrote:

...
That could probably help; I'd still want to make the job names unique to prevent multiple workflows under one user from delaying each other, but I'd be able to have something much closer to correct without a lot of second-guessing the submission return code.

On Tue, Feb 17, 2026 at 9:12 PM Kevin Buckley via slurm-users < slurm-users@lists.schedmd.com> wrote:

...
On 2026/02/18 01:56, Adam Novak via slurm-users wrote:

...
... Toil can't handle multiple copies of the same job running at once ... Is it possible to write an idempotent sbatch command, where it can be

run

...
any number of times but will only actually submit one copy of the job?

Could you not make use of the

--dependency=singleton

constraint, to achieve something close to what your meta-scheduler needs?

From the sbatch manpage:
 singleton
     This job can begin execution after any previously launched
jobs sharing the same job name and user have terminated. In other words, only one job by that name and owned by that user can be running or suspended at any point in time. In a federation, a singleton dependency must be fulfilled on all clusters unless DependencyParameters=disable_remote_singleton is used in slurm.conf.

You would still need to catch any queued dupe(s) that your meta-scheduler created but there wouldn't be two running at once.

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- Adam Novak (He/Him) Senior Software Engineer Computational Genomics Lab UC Santa Cruz Genomics Institute "Revealing life’s code."

Personal Feedback: https://forms.gle/UXZhZc123knF65Dw5

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- Adam Novak (He/Him) Senior Software Engineer Computational Genomics Lab UC Santa Cruz Genomics Institute "Revealing life’s code."

Personal Feedback: https://forms.gle/UXZhZc123knF65Dw5

Christopher Samuel

23 Feb 23 Feb

9:41 p.m.

On 2/17/26 12:56 pm, Adam Novak via slurm-users wrote:

...

I'm working on the Slurm integration in our Toil workflow runner project. I'm having a problem where an `sbatch` command to submit a job to Slurm can fail (with exit code 1 and message "sbatch: error: Batch job submission failed: Socket timed out on send/recv operation", in my case, but possibly in other ways), but the job can still actually have been submitted, and can still execute.

I know others have given ideas on working around this, but have you had a chance to dig into why this is happening for you? That sort of network timeout points to either the slurmctld being totally overwhelmed with RPCs, or wedged in I/O, or some odd network problem.

Do you see anything diagnostic in the slurmctld logs when that's happening?

All the best, Chris

-- Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA

Adam Novak

24 Feb 24 Feb

10:36 a.m.

I'm not really in a position to check, since I'm not our cluster admin. I asked him and he thought it might be down to high load on the client node at that point in time; we often run submission commands from our shared compute nodes, which can become overloaded because they aren't themselves managed by a scheduler. If it's *not* that and it's something we really need to investigate, that would be good to know.

On Mon, Feb 23, 2026 at 9:42 PM Christopher Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

On 2/17/26 12:56 pm, Adam Novak via slurm-users wrote:

...
I'm working on the Slurm integration in our Toil workflow runner project. I'm having a problem where an `sbatch` command to submit a job to Slurm can fail (with exit code 1 and message "sbatch: error: Batch job submission failed: Socket timed out on send/recv operation", in my case, but possibly in other ways), but the job can still actually have been submitted, and can still execute.

I know others have given ideas on working around this, but have you had a chance to dig into why this is happening for you? That sort of network timeout points to either the slurmctld being totally overwhelmed with RPCs, or wedged in I/O, or some odd network problem.

Do you see anything diagnostic in the slurmctld logs when that's happening?

All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- Adam Novak (He/Him) Senior Software Engineer Computational Genomics Lab UC Santa Cruz Genomics Institute "Revealing life’s code." Personal Feedback: https://forms.gle/UXZhZc123knF65Dw5

Age (days ago)

Last active (days ago)

slurm-users@lists.schedmd.com

7 comments

4 participants

tags (0)

participants (4)

Adam Novak
Christopher Samuel
Davide DelVento
Kevin Buckley