[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Fri Jun 14 07:03:55 UTC 2019

Christopher Benjamin Coffey <Chris.Coffey at nau.edu> writes:

> Hi, you may want to look into increasing the sssd cache length on the
> nodes,

We have thought about that, but it will not solve the problem, only make
it less frequent, I think.

> and improving the network connectivity to your ldap
> directory.

That is something we are investigating, yes.

> I recall when playing with sssd in the past that it wasn't
> actually caching. Verify with tcpdump, and "ls -l" through a
> directory. Once the uid/gid is resolved, it shouldn't be hitting the
> directory anymore till the cache expires.

We turned up the logging of the AD backend, and the logs indicate that
the caching works in our case: First time you look up a user/group in a
while, the backend gets the request, but subsequent lookups never reach
the backend (at least not according to the logs), which should mean that
sssd has cached the info.

> Do the nodes NAT through the head node?

We do, but we see the sssd delays on the head node as well, and on other
nodes outside the cluster that use the same ldap/da servers.  But we
_do_ have a quite complicated network setup due to security, so there
might be something there.  I'm currently trying to get my hands on the
logs from the servers themselves to see they actually get the requests
at the time when the sssd backend claims to make it.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190614/0c1ad554/attachment-0001.sig>