[slurm-users] Socket timed out on send/recv operation

Chris Samuel chris at csamuel.org
Sat Oct 20 21:52:23 MDT 2018


On Friday, 19 October 2018 4:58:37 AM AEDT Kirk Main wrote:

> I'm a new administrator to Slurm and I've just got my new cluster up and
> running. We started getting a lot of "Socket timed out on send/recv
> operation" errors when submitting jobs, and also if you try to "squeue"
> while others are submitting jobs. The job does eventually run after about a
> minute, but the entire system feels very sluggish and obviously this isn't
> normal. Not sure whats going on here...

Hmm, you're trying to do HA for Slurm with NFS.  I suspect that's going to be 
killing you unless your NFS server is very very fast.

>From conversations I've had with folks in the past if you want to do HA you 
need shared storage that can sustain a lot of IOPS for it to really be usable.

Try it without HA first *AND* use local disk for your state directory, to see 
if the problem goes away.   If it does then you know you're going to need to 
find a different way to do that storage that in future if you really want to 
do HA.

If it doesn't go away then you'll know there's something more fundamental 
going on, but from what you describe it really does sound like NFS latencies 
are the problem here.

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC






More information about the slurm-users mailing list