[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
b.h.mevik at usit.uio.no
Wed Jun 12 08:53:51 UTC 2019
Another possible cause (we currently see it on one of our clusters):
delays in ldap lookups.
We have sssd on the machines, and occasionally, when sssd contacts the
ldap server, it takes 5 or 10 seconds (or even 15) before it gets an
answer. If that happens because slurmctld is trying to look up some
user or group, etc, client commands depending on it will hang. The
default message timeout is 10 seconds, so if the delay is more than
that, you get the timeout error.
We don't know why the delays are happening, but while we are debugging
it, we've increased the MessageTimeout, which seems to have reduced the
problem a bit. We're also experimenting with GroupUpdateForce and
GroupUpdateTime to reduce the number of times slurmctld needs to ask
about groups, but I'm unsure how much that helps.
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 832 bytes
Desc: not available
More information about the slurm-users