[slurm-users] Slurm / OpenHPC socket timeout errors

Kenneth Roberts kroberts at materialsdesign.com
Mon Nov 26 09:34:59 MST 2018


Here is the debug log on a node (c2) when the job fails ....

 

c2: [2018-11-26T07:35:56.261] debug3: in the service_connection

c2: [2018-11-26T07:36:16.281] debug:  slurm_recv_timeout at 0 of 9680,
timeout

c2: [2018-11-26T07:36:16.282] error: slurm_receive_msg_and_forward: Socket
timed out on send/recv operation

c2: [2018-11-26T07:36:16.292] error: service_connection: slurm_receive_msg:
Socket timed out on send/recv operation

c2: [2018-11-26T07:36:16.334] debug3: in the service_connection

 

the line, debug:  slurm_recv_timeout at 0 of 9680, timeout - looks like it
times out before reading even the first byte of the message.  

 

Here is the code snippet that generates that debug message:

 

extern int slurm_recv_timeout(int fd, char *buffer, size_t size, uint32_t
flags, int timeout )

.

.

.

while (recvlen < size) {

           timeleft = timeout - _tot_wait(&tstart);

           if (timeleft <= 0) {

                debug("%s at %d of %zu, timeout", __func__, recvlen, size);

                slurm_seterrno(SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT);

                recvlen = SLURM_ERROR;

                goto done;

           }

 

recvlen is 0 based on the log message, which might indicate it error'd on
the first time through (timeleft <= 0).  

 

MessageTimeout=20 in our slurm.conf

But this code acts like it was passed timeout = 0??

 

Up the call stack, slurm_receive_msg_and_forward, sets the timeout to the
default:

 

if (timeout <= 0)

           /* convert secs to msec */

           timeout = slurm_get_msg_timeout() * 1000;

 

Unless slurm_get_msg_timeout() is not working?

 

It may be that the slurm.conf values aren't getting set correctly or used
correctly, though I don't see anything like permission errors reading
slurm.conf ...

 

Continuing the search ...

 

From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of
Kenneth Roberts
Sent: Friday, November 23, 2018 4:15 PM
To: slurm-users at lists.schedmd.com
Subject: [slurm-users] Slurm / OpenHPC socket timeout errors

 

Hi -

 

I have the following on a new cluster with OpenHPC & Slurm built off the
latest recipe and packages from OpenHPC (built this week).

 

One master node and 4 compute nodes.

NodeName=c[1-4] Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN

 

With simple test scripts, sbatch produces the following error when running
across more than one node -

 

The batch script is -

 

#!/bin/bash

srun hostname

 

$ sbatch -N4 -n4 hostname.sh

 

Out file --

c1

srun: error: Task launch for 151.0 failed on node c4: Socket timed out on
send/recv operation

srun: error: Task launch for 151.0 failed on node c3: Socket timed out on
send/recv operation

srun: error: Task launch for 151.0 failed on node c2: Socket timed out on
send/recv operation

srun: error: Application launch failed: Socket timed out on send/recv
operation

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

srun: error: Timed out waiting for job step to complete

 

Searching on this discovers a lot of info about large jobs and starting a
lot of tasks really quickly with some timeout and large cluster setting
recommendations. BUT I'm running four tasks that are just 'hostname'!

 

AND   If I just execute command line srun it works across the nodes

 

$ srun -N4 -n4 hostname

c1

c2

c3

c4

 

Also, if I sbatch 20 tasks on one node max, it launches them fine. But 21
tasks (which tries to launch on two nodes) works on the c1 node (with 20
lines of output) and fails on the 21st task on c2 -

c1

c1

c1

... (17 more)

srun: error: Task launch for 156.0 failed on node c2: Socket timed out on
send/recv operation

srun: error: Application launch failed: Socket timed out on send/recv
operation

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

srun: error: Timed out waiting for job step to complete

 

 

Maybe I completely don't get sbatch options/params (I'm using defaults). BUT
I'm attempting the simplest thing I could think of just to test this out.

 

Trying another approach to test, a script that uses a job array and runs 32
copies of a simple python script (so there's no srun in the batch script)
appears to work properly and utilizes all the nodes. But sbatch a script
with srun in the script gives the errors.

 

Really hoping this is something obvious that as a noob to OpenHPC and Slurm
I'm getting wrong.

 

Thanks in advance for any pointers or answers!

 

Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181126/23d0ad9c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4987 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181126/23d0ad9c/attachment-0001.bin>


More information about the slurm-users mailing list