[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Christopher W. Harrop christopher.w.harrop at noaa.gov
Thu Jun 13 16:03:47 UTC 2019


> ...
>> One way I?m using to work around this is to inject a long random string
>> into the ?comment option.  Then, if I see the socket timeout, I use squeue
>> to look for that job and retrieve its ID.  It?s not ideal, but it can work.
> 
> I would have expected a different approach: use a unique string for the
> jobname, and always verify after submission.  after all, squeue provides
> a --name parameter for this (efficient query by logical job "identity”).

The job name is already in use, and it is not unique because there may be many copies of a workflow running at the same time by the same user.   There is essentially no difference between verifying a match with jobname and a match with the comment; it’s just a different field of the output you’re looking at, which you can control with format options.




More information about the slurm-users mailing list