[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
Christopher W. Harrop
christopher.w.harrop at noaa.gov
Thu Jun 13 16:03:47 UTC 2019
>> One way I?m using to work around this is to inject a long random string
>> into the ?comment option. Then, if I see the socket timeout, I use squeue
>> to look for that job and retrieve its ID. It?s not ideal, but it can work.
> I would have expected a different approach: use a unique string for the
> jobname, and always verify after submission. after all, squeue provides
> a --name parameter for this (efficient query by logical job "identity”).
The job name is already in use, and it is not unique because there may be many copies of a workflow running at the same time by the same user. There is essentially no difference between verifying a match with jobname and a match with the comment; it’s just a different field of the output you’re looking at, which you can control with format options.
More information about the slurm-users