[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Marcelo Garcia Marcelo.Garcia at EMEA.NEC.COM
Wed Jun 12 07:42:48 UTC 2019


Hi Steffen

We are using Lustre as underlying file system:
[root at teta2 ~]# cat /proc/fs/lustre/version
lustre: 2.7.19.11

Nothing has changed. I think this is happening for a long time, but before was very sporadic, and only recently became more frequent. 

Best Regards

mg.


-----Original Message-----
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Steffen Grunewald
Sent: Dienstag, 11. Juni 2019 16:28
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

On Tue, 2019-06-11 at 13:56:34 +0000, Marcelo Garcia wrote:
> Hi 
> 
> Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the command "sbatch" fails:
> 
> + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1
> sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

I've seen such an error message from the underlying file system.
Is there anything special (e.g. non-NFS) in your setup that may have changed
in the past few months?

Just a shot in the dark, of course...

> Ecflow runs preprocessing on the script which generates a second script that is submitted to slurm. In our case, the submission script is called "42.job1". 
> 
> The problem we have is that sometimes, the "sbatch" command fails with the message above. We couldn't find any hint on the logs. Hardware and software logs are clean. I increased the debug level of slurm, to 
> # scontrol show config
> (..._)
> SlurmctldDebug          = info
> 
> But still not glue about what is happening. Maybe the next thing to try is to use "sdiag" to inspect the server. Another complication is that the problem is random, so we put "sdiag" in a cronjob? Is there a better way to run "sdiag" periodically?
> 
> Thnaks for your attention.
> 
> Best Regards
> 
> mg.
> 

- S

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~



 Click https://www.mailcontrol.com/sr/C3sVfTezEznGX2PQPOmvUj911dVlkoGM8wtqpF4T7nO4ifXHGgg4hDJ1wA0Q6k9yVX4zexuKDmbIiTKH8SslWQ==  to report this email as spam.



More information about the slurm-users mailing list