I'm not really in a position to check, since I'm not our cluster admin. I asked him and he thought it might be down to high load on the client node at that point in time; we often run submission commands from our shared compute nodes, which can become overloaded because they aren't themselves managed by a scheduler. If it's *not* that and it's something we really need to investigate, that would be good to know.

On Mon, Feb 23, 2026 at 9:42 PM Christopher Samuel via slurm-users <slurm-users@lists.schedmd.com> wrote:

On 2/17/26 12:56 pm, Adam Novak via slurm-users wrote:

> I'm working on the Slurm integration in our Toil workflow runner
> project. I'm having a problem where an `sbatch` command to submit a job
> to Slurm can fail (with exit code 1 and message "sbatch: error: Batch
> job submission failed: Socket timed out on send/recv operation", in my
> case, but possibly in other ways), but the job can still actually have
> been submitted, and can still execute.

I know others have given ideas on working around this, but have you had
a chance to dig into why this is happening for you? That sort of network
timeout points to either the slurmctld being totally overwhelmed with
RPCs, or wedged in I/O, or some odd network problem.

Do you see anything diagnostic in the slurmctld logs when that's happening?

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com