[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
hahn at mcmaster.ca
Thu Jun 13 15:49:50 UTC 2019
On Thu, 13 Jun 2019, Christopher Harrop - NOAA Affiliate wrote:
> One way I?m using to work around this is to inject a long random string
>into the ?comment option. Then, if I see the socket timeout, I use squeue
>to look for that job and retrieve its ID. It?s not ideal, but it can work.
I would have expected a different approach: use a unique string for the
jobname, and always verify after submission. after all, squeue provides
a --name parameter for this (efficient query by logical job "identity").
regards, mark hahn.
More information about the slurm-users