[slurm-users] Socket timed out on send/recv operation
Chris Samuel
chris at csamuel.org
Sat Oct 20 21:52:23 MDT 2018
On Friday, 19 October 2018 4:58:37 AM AEDT Kirk Main wrote:
> I'm a new administrator to Slurm and I've just got my new cluster up and
> running. We started getting a lot of "Socket timed out on send/recv
> operation" errors when submitting jobs, and also if you try to "squeue"
> while others are submitting jobs. The job does eventually run after about a
> minute, but the entire system feels very sluggish and obviously this isn't
> normal. Not sure whats going on here...
Hmm, you're trying to do HA for Slurm with NFS. I suspect that's going to be
killing you unless your NFS server is very very fast.
>From conversations I've had with folks in the past if you want to do HA you
need shared storage that can sustain a lot of IOPS for it to really be usable.
Try it without HA first *AND* use local disk for your state directory, to see
if the problem goes away. If it does then you know you're going to need to
find a different way to do that storage that in future if you really want to
do HA.
If it doesn't go away then you'll know there's something more fundamental
going on, but from what you describe it really does sound like NFS latencies
are the problem here.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
More information about the slurm-users
mailing list