[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Thu Jun 13 14:47:07 UTC 2019

Hi,

My group is struggling with this also.  

The worst part of this, which no one has brought up yet, is that the sbatch command does not necessarily fail to submit the job in this situation.  In fact, most of the time (for us), it succeeds.  There appears to be some sort of race condition or something else going on.  The job is often (maybe most of the time?) submitted just fine, but sbatch returns a non-zero status (meaning the submission failed) and reports the error message.  

From a workflow management perspective this is an absolute disaster that leads to workflow corruption and messes that are difficult to clean up.  Workflow management systems rely on the status for sbatch to tell the truth about whether a job submission succeeded or not.  If submission fails the workflow manager will resubmit the job, and if it succeeds it expects a jobid to be returned.  Because sbatch usually lies about the failure of job submission when these events happen, workflow management systems think the submission failed and then resubmit the job.  This causes two copies of the same job to be running at the same time, each trampling over the other and causing a cascade of other failures that become difficult to deal with.

The problem is that the job submission request has already been received by the time sbatch dies with that error.  So, the timeout happens after the job request has already been made.  I don’t know how one would solve this problem.  In my experience in interfacing various batch schedulers to workflow management systems I’ve learned that attempting to time out qsub/sbatch/bsub/etc commands always leads to a race condition. You can’t time it out (barring ridiculously long timeouts to catch truly pathological scenarios) because the request has already been sent and received; it’s the response that never makes it back to you.  Because of the race condition there is probably no way to guarantee that failure really means failure and success really means success and use a timeout that guarantees failure.  The best option that I know of is to never (this means a finite, but long, time) time out a job submission command; just wait for the response.  That’s the only way to get the correct response.

One way I’m using to work around this is to inject a long random string into the —comment option.  Then, if I see the socket timeout, I use squeue to look for that job and retrieve its ID.  It’s not ideal, but it can work.

Chris