[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
Christopher Harrop - NOAA Affiliate
christopher.w.harrop at noaa.gov
Fri Jun 14 13:29:26 UTC 2019
> Hi Chris
>
> You are right in pointing that the job actually runs, despite of the error in the sbatch. The customer mention that:
> === start ===
> Problem had usual scenario - job script was submitted and executed, but sbatch command returned non-zero exit status to ecflow, which thus assumed job to be dead.
> === end ===
>
> Which version of slurm are you using? I'm using " 17.02.4-1", and we are wondering about the possibility of upgrading to a newer version, that is, I hope that there was a bug and Schedmd fixed the problem.
Sorry I missed that. I am not the admin of the system, but I believe we are using 18.08.7. I believe we have a ticket open with SchedMD and our admin team is working with them. And I believe the approach being taken is to capture statistics with sdiag and use that info to tune configuration parameters. It is my understanding that they view the problem as a configuration issue rather than a bug in the scheduler. What this means to me is that the timeouts can only be minimized, not eliminated. And because workflow corruption is such a disastrous event, I have built in attempts to try to work around it even though occurrences are “rare”.
Chris
More information about the slurm-users
mailing list