[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Fri Jun 14 13:29:26 UTC 2019

> Hi Chris
> 
> You are right in pointing that the job actually runs, despite of the error in the sbatch. The customer mention that:
> === start ===
> Problem had usual scenario - job script was submitted and executed, but sbatch command returned non-zero exit status to ecflow, which thus  assumed job to be dead.
> === end ===
> 
> Which version of slurm are you using? I'm using " 17.02.4-1", and we are wondering about the possibility of upgrading to a newer version, that is, I hope that there was a bug and Schedmd fixed the problem.

Sorry I missed that.  I am not the admin of the system, but I believe we are using 18.08.7.  I believe we have a ticket open with SchedMD and our admin team is working with them.  And I believe the approach being taken is to capture statistics with sdiag and use that info to tune configuration parameters.  It is my understanding that they view the problem as a configuration issue rather than a bug in the scheduler.  What this means to me is that the timeouts can only be minimized, not eliminated.  And because workflow corruption is such a disastrous event, I have built in attempts to try to work around it even though occurrences are “rare”.  

Chris