[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Mark Hahn hahn at mcmaster.ca
Thu Jun 13 15:49:50 UTC 2019


On Thu, 13 Jun 2019, Christopher Harrop - NOAA Affiliate wrote:
...
> One way I?m using to work around this is to inject a long random string
>into the ?comment option.  Then, if I see the socket timeout, I use squeue
>to look for that job and retrieve its ID.  It?s not ideal, but it can work.

I would have expected a different approach: use a unique string for the
jobname, and always verify after submission.  after all, squeue provides
a --name parameter for this (efficient query by logical job "identity").

regards, mark hahn.



More information about the slurm-users mailing list