<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

    <style type="text/css">body p { margin-bottom: 0cm; margin-top: 0pt; } </style>

  </head>

  <body bidimailui-detected-decoding-type="UTF-8" text="#000000"

    bgcolor="#FFFFFF">

    <p>I had similar problems in the past.</p>

    <p>The 2 most common issues were:</p>

    <p>1. Controller load - if the slurmctld was in heavy use, it

      sometimes didn't respond in timely manner, exceeding the timeout

      limit.</p>

    <p>2. Topology and msg forwarding and aggregation.</p>

    <p><br>

    </p>

    <p>For 2 - it would seem the nodes designated for forwarding are

      statically assigned based on topology. I could be wrong, but

      that's my observation, as I would get the socket timeout error

      when they had issues, even though other nodes in the same topology

      'zone' were ok and could be used instead.</p>

    <p><br>

    </p>

    <p>It took debug3 to observe this in the logs, I think.</p>

    <p><br>

    </p>

    <p>HTH</p>

    <p>--Dani_L.<br>

    </p>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 6/11/19 5:27 PM, Steffen Grunewald

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:20190611142739.g3nuo4u3lt3v2nh3@eddie.aei.mpg.de">

      <pre class="moz-quote-pre" wrap="">On Tue, 2019-06-11 at 13:56:34 +0000, Marcelo Garcia wrote:

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">Hi 

Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the command "sbatch" fails:

+ sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1

sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

I've seen such an error message from the underlying file system.

Is there anything special (e.g. non-NFS) in your setup that may have changed

in the past few months?

Just a shot in the dark, of course...

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">Ecflow runs preprocessing on the script which generates a second script that is submitted to slurm. In our case, the submission script is called "42.job1". 

The problem we have is that sometimes, the "sbatch" command fails with the message above. We couldn't find any hint on the logs. Hardware and software logs are clean. I increased the debug level of slurm, to 

# scontrol show config

(..._)

SlurmctldDebug          = info

But still not glue about what is happening. Maybe the next thing to try is to use "sdiag" to inspect the server. Another complication is that the problem is random, so we put "sdiag" in a cronjob? Is there a better way to run "sdiag" periodically?

Thnaks for your attention.

Best Regards

mg.

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

- S

</pre>

    </blockquote>

    <br>

  </body>

</html>