[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

John Hearns hearnsj at googlemail.com
Thu Jun 13 16:02:45 UTC 2019


I agree with Christopher Coffey - look at the sssd caching.
I have had experience with sssd and can help a bit.
Also if you are seeing long waits could you have nested groups?
sssd is notorious for not handling these well, and there are settings in
the configuration file which you can experiment with.

On Thu, 13 Jun 2019 at 16:52, Mark Hahn <hahn at mcmaster.ca> wrote:

> On Thu, 13 Jun 2019, Christopher Harrop - NOAA Affiliate wrote:
> ...
> > One way I?m using to work around this is to inject a long random string
> >into the ?comment option.  Then, if I see the socket timeout, I use squeue
> >to look for that job and retrieve its ID.  It?s not ideal, but it can
> work.
>
> I would have expected a different approach: use a unique string for the
> jobname, and always verify after submission.  after all, squeue provides
> a --name parameter for this (efficient query by logical job "identity").
>
> regards, mark hahn.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190613/1afd66c6/attachment.html>


More information about the slurm-users mailing list