[slurm-users] Slurm / OpenHPC socket timeout errors

Marcus Wagner wagner at itc.rwth-aachen.de
Tue Nov 27 04:39:00 MST 2018


Hi Ken,

if the switches between the hosts are correctly configured, this SHOULD 
pose no problem, but we saw ourselves problems with Path MTU Discovery. 
If the switches are wrongly configured, it will be a very one-sided 
communication between the two hosts with the different MTUs J

It gets even more problematic if you have some UDP communication and 
routers between several subnets. Our old cluster used IPoIB with a MTU 
of 65535 (I think) in connected mode. We had several compute nodes and 
one backup master in the Ethernet world, connected via ib-ethernet gateways.
Whenever our brandnew masters were online, it could take hours up to one 
day, until all hosts were discovered as being online in LSF, whereas it 
took minutes, when we used our old master from the ethernet world. UDP 
does not do something like path discovery, it just sends packets as big 
as it is allowed to, leading to crippled packets on the remote host.

Nonetheless, as far as me knows, slurm just communicates by TCP, so 
there might have been a misconfigured switch between those hosts?


Best
Marcus

On 11/26/2018 09:49 PM, Kenneth Roberts wrote:
>
> D’oh!
>
> The compute nodes had different MTU on the network interfaces than the 
> master.  Once all set to 1500, it works!
>
> So ... any ideas why that was a problem? Maybe the interfaces had no 
> fragmentation set and there were dropped packets?
>
> Thanks for listening.
>
> Ken
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf 
> Of *Kenneth Roberts
> *Sent:* Monday, November 26, 2018 9:38 AM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* Re: [slurm-users] Slurm / OpenHPC socket timeout errors
>
> I wasn’t looking close enough at the times in the log file.
>
> c2: [2018-11-26T10:*_09:40_*.963] debug3: in the service_connection
>
> c2: [2018-11-26T10:*_10:00_*.983] debug:  slurm_recv_timeout at 0 of 
> 9589, timeout
>
> c2: [2018-11-26T10:10:00.983] error: slurm_receive_msg_and_forward: 
> Socket timed out on send/recv operation
>
> c2: [2018-11-26T10:10:00.994] error: service_connection: 
> slurm_receive_msg: Socket timed out on send/recv operation
>
> c2: [2018-11-26T10:10:01.106] debug3: in the service_connection
>
> It looks like slurm_recv_timeout is attempting for 20 seconds and the 
> call is just hitting continue without reading any data –
>
>            if ((rc = poll(&ufds, 1, timeleft)) <= 0) {
>
> *if ((errno == EINTR) || (errno == EAGAIN) || (rc == 0))*
>
> *                     continue;*
>
>                 else {
>
>                      debug("%s at %d of %zu, poll error: %m",
>
>                            __func__, recvlen, size);
>
>                       slurm_seterrno(
>
> SLURM_COMMUNICATIONS_RECEIVE_ERROR);
>
>                      recvlen = SLURM_ERROR;
>
>                      goto done;
>
>                 }
>
>            }
>
> So poll is timing out after 20 seconds.
>
> Back to finding out why ...
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com 
> <mailto:slurm-users-bounces at lists.schedmd.com>> *On Behalf Of *Kenneth 
> Roberts
> *Sent:* Monday, November 26, 2018 8:35 AM
> *To:* slurm-users at lists.schedmd.com <mailto:slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] Slurm / OpenHPC socket timeout errors
>
> Here is the debug log on a node (c2) when the job fails ....
>
> c2: [2018-11-26T07:35:56.261] debug3: in the service_connection
>
> c2: [2018-11-26T07:36:16.281] *debug: slurm_recv_timeout at 0 of 9680, 
> timeout*
>
> c2: [2018-11-26T07:36:16.282] error: slurm_receive_msg_and_forward: 
> Socket timed out on send/recv operation
>
> c2: [2018-11-26T07:36:16.292] error: service_connection: 
> slurm_receive_msg: Socket timed out on send/recv operation
>
> c2: [2018-11-26T07:36:16.334] debug3: in the service_connection
>
> the line, debug:  slurm_recv_timeout at 0 of 9680, timeout – looks 
> like it times out before reading even the first byte of the message.
>
> Here is the code snippet that generates that debug message:
>
> extern int slurm_recv_timeout(int fd, char *buffer, size_t size, 
> uint32_t flags, int timeout )
>
> .
>
> .
>
> .
>
> while (recvlen < size) {
>
>            timeleft = timeout - _tot_wait(&tstart);
>
>            if (timeleft <= 0) {
>
> *debug("%s at %d of %zu, timeout", __func__, recvlen, size);*
>
> slurm_seterrno(SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT);
>
>                 recvlen = SLURM_ERROR;
>
>                 goto done;
>
>            }
>
> recvlen is 0 based on the log message, which might indicate it error’d 
> on the first time through (timeleft <= 0).
>
> MessageTimeout=20 in our slurm.conf
>
> But this code acts like it was passed timeout = 0??
>
> Up the call stack, slurm_receive_msg_and_forward, sets the timeout to 
> the default:
>
> if (timeout <= 0)
>
>            /* convert secs to msec */
>
>            timeout = slurm_get_msg_timeout() * 1000;
>
> Unlessslurm_get_msg_timeout() is not working?
>
> It may be that the slurm.conf values aren’t getting set correctly or 
> used correctly, though I don’t see anything like permission errors 
> reading slurm.conf ...
>
> Continuing the search ...
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com 
> <mailto:slurm-users-bounces at lists.schedmd.com>> *On Behalf Of *Kenneth 
> Roberts
> *Sent:* Friday, November 23, 2018 4:15 PM
> *To:* slurm-users at lists.schedmd.com <mailto:slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] Slurm / OpenHPC socket timeout errors
>
> Hi –
>
> I have the following on a new cluster with OpenHPC & Slurm built off 
> the latest recipe and packages from OpenHPC (built this week).
>
> One master node and 4 compute nodes.
>
> NodeName=c[1-4] Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN
>
> With simple test scripts, sbatch produces the following error when 
> running across more than one node –
>
> The batch script is –
>
> #!/bin/bash
>
> srun hostname
>
> $ sbatch -N4 -n4 hostname.sh
>
> Out file --
>
> c1
>
> srun: error: Task launch for 151.0 failed on node c4: Socket timed out 
> on send/recv operation
>
> srun: error: Task launch for 151.0 failed on node c3: Socket timed out 
> on send/recv operation
>
> srun: error: Task launch for 151.0 failed on node c2: Socket timed out 
> on send/recv operation
>
> srun: error: Application launch failed: Socket timed out on send/recv 
> operation
>
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>
> srun: error: Timed out waiting for job step to complete
>
> Searching on this discovers a lot of info about large jobs and 
> starting a lot of tasks really quickly with some timeout and large 
> cluster setting recommendations. BUT I’m running four tasks that are 
> just ‘hostname’!
>
> AND   If I just execute command line srun it works across the nodes
>
> $ srun -N4 -n4 hostname
>
> c1
>
> c2
>
> c3
>
> c4
>
> Also, if I sbatch 20 tasks on one node max, it launches them fine. But 
> 21 tasks (which tries to launch on two nodes) works on the c1 node 
> (with 20 lines of output) and fails on the 21^st task on c2 –
>
> c1
>
> c1
>
> c1
>
> ... (17 more)
>
> srun: error: Task launch for 156.0 failed on node c2: Socket timed out 
> on send/recv operation
>
> srun: error: Application launch failed: Socket timed out on send/recv 
> operation
>
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>
> srun: error: Timed out waiting for job step to complete
>
> Maybe I completely don’t get sbatch options/params (I’m using 
> defaults). BUT I’m attempting the simplest thing I could think of just 
> to test this out.
>
> Trying another approach to test, a script that uses a job array and 
> runs 32 copies of a simple python script (so there’s no srun in the 
> batch script) appears to work properly and utilizes all the nodes. But 
> sbatch a script with srun in the script gives the errors.
>
> Really hoping this is something obvious that as a noob to OpenHPC and 
> Slurm I’m getting wrong.
>
> Thanks in advance for any pointers or answers!
>
> Ken
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181127/6ff20ba2/attachment-0001.html>


More information about the slurm-users mailing list