[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Fri Jun 14 08:40:26 UTC 2019

Hi Chris

You are right in pointing that the job actually runs, despite of the error in the sbatch. The customer mention that:
=== start ===
Problem had usual scenario - job script was submitted and executed, but sbatch command returned non-zero exit status to ecflow, which thus  assumed job to be dead.
=== end ===

Which version of slurm are you using? I'm using " 17.02.4-1", and we are wondering about the possibility of upgrading to a newer version, that is, I hope that there was a bug and Schedmd fixed the problem.

Best Regards

mg.

-----Original Message-----
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Christopher Harrop - NOAA Affiliate
Sent: Donnerstag, 13. Juni 2019 16:47
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Hi,

My group is struggling with this also.  

The worst part of this, which no one has brought up yet, is that the sbatch command does not necessarily fail to submit the job in this situation.  In fact, most of the time (for us), it succeeds.  There appears to be some sort of race condition or something else going on.  The job is often (maybe most of the time?) submitted just fine, but sbatch returns a non-zero status (meaning the submission failed) and reports the error message.  

From a workflow management perspective this is an absolute disaster that leads to workflow corruption and messes that are difficult to clean up.  Workflow management systems rely on the status for sbatch to tell the truth about whether a job submission succeeded or not.  If submission fails the workflow manager will resubmit the job, and if it succeeds it expects a jobid to be returned.  Because sbatch usually lies about the failure of job submission when these events happen, workflow management systems think the submission failed and then resubmit the job.  This causes two copies of the same job to be running at the same time, each trampling over the other and causing a cascade of other failures that become difficult to deal with.

The problem is that the job submission request has already been received by the time sbatch dies with that error.  So, the timeout happens after the job request has already been made.  I don’t know how one would solve this problem.  In my experience in interfacing various batch schedulers to workflow management systems I’ve learned that attempting to time out qsub/sbatch/bsub/etc commands always leads to a race condition. You can’t time it out (barring ridiculously long timeouts to catch truly pathological scenarios) because the request has already been sent and received; it’s the response that never makes it back to you.  Because of the race condition there is probably no way to guarantee that failure really means failure and success really means success and use a timeout that guarantees failure.  The best option that I know of is to never (this means a finite, but long, time) time out a job submission command; just wait for the response.  That’s the only way to get the correct response.

One way I’m using to work around this is to inject a long random string into the —comment option.  Then, if I see the socket timeout, I use squeue to look for that job and retrieve its ID.  It’s not ideal, but it can work.

Chris

 Click https://www.mailcontrol.com/sr/BSE5ulXU973GX2PQPOmvUujshICbHL2sPjokthLG0LGuvOKuSd7RBPQ08h87nB53U3B_o6vD7mIfmF8UmgH1OQ==  to report this email as spam.