[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Thu Jun 13 15:06:22 UTC 2019

The error message cited is associated with SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT, which is only ever raised by slurm_send_timeout() and slurm_recv_timeout().  Those functions raise that error when a generic socket-based send/receive operation exceeds an arbitrary time limit imposed by the caller.  The functions use gettimeofday() to grab an initial timestamp and on each iteration of the poll() loop call gettimeofday() again, calculating a delta from the initial and current values returned by that function and subtracting from the timeout period.

Do you have any reason to suspect that your local times are fluctuating on the cluster?  That use of gettimeofday() to calculate actual time deltas is not recommended for that very reason:

NOTES
       The time returned by gettimeofday() is affected by discontinuous jumps in the system time (e.g., if the system
       administrator manually changes the system time).  If you need a monotonically increasing clock, see clock_get‐
       time(2).

> On Jun 13, 2019, at 10:47 AM, Christopher Harrop - NOAA Affiliate <christopher.w.harrop at noaa.gov> wrote:
> 
> Hi,
> 
> My group is struggling with this also.  
> 
> The worst part of this, which no one has brought up yet, is that the sbatch command does not necessarily fail to submit the job in this situation.  In fact, most of the time (for us), it succeeds.  There appears to be some sort of race condition or something else going on.  The job is often (maybe most of the time?) submitted just fine, but sbatch returns a non-zero status (meaning the submission failed) and reports the error message.  
> 
> From a workflow management perspective this is an absolute disaster that leads to workflow corruption and messes that are difficult to clean up.  Workflow management systems rely on the status for sbatch to tell the truth about whether a job submission succeeded or not.  If submission fails the workflow manager will resubmit the job, and if it succeeds it expects a jobid to be returned.  Because sbatch usually lies about the failure of job submission when these events happen, workflow management systems think the submission failed and then resubmit the job.  This causes two copies of the same job to be running at the same time, each trampling over the other and causing a cascade of other failures that become difficult to deal with.
> 
> The problem is that the job submission request has already been received by the time sbatch dies with that error.  So, the timeout happens after the job request has already been made.  I don’t know how one would solve this problem.  In my experience in interfacing various batch schedulers to workflow management systems I’ve learned that attempting to time out qsub/sbatch/bsub/etc commands always leads to a race condition. You can’t time it out (barring ridiculously long timeouts to catch truly pathological scenarios) because the request has already been sent and received; it’s the response that never makes it back to you.  Because of the race condition there is probably no way to guarantee that failure really means failure and success really means success and use a timeout that guarantees failure.  The best option that I know of is to never (this means a finite, but long, time) time out a job submission command; just wait for the response.  That’s the only way to get the correct response.
> 
> One way I’m using to work around this is to inject a long random string into the —comment option.  Then, if I see the socket timeout, I use squeue to look for that job and retrieve its ID.  It’s not ideal, but it can work.
> 
> Chris
> 

::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190613/f72e79dd/attachment.html>