<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <p>Answering my own question, I got private email which points to
      <a class="moz-txt-link-rfc2396E" href="https://bugs.schedmd.com/show_bug.cgi?id=4412"><https://bugs.schedmd.com/show_bug.cgi?id=4412></a>, describing
      both the problem and the solution. (Thanks Matthieu!)<br>
    </p>
    <p>Andy<br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 12/08/2017 11:06 AM, Andy Riebs
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:04af12b4-0947-3fbc-e340-bfd476180035@hpe.com">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <p>I've gathered more information, and I am probably having a
        fight with pam.  First, of note, this problem can be reproduced
        with a single node, single task job, such as</p>
      <p><tt>$ sbatch -N1 --reservation awr </tt><tt><br>
        </tt><tt>#!/bin/bash</tt><tt><br>
        </tt><tt>hostname</tt><tt><br>
        </tt><tt>Submitted batch job 90436</tt><tt><br>
        </tt><tt>$ sinfo -R</tt><tt><br>
        </tt><tt>batch job complete f slurm     2017-12-08T15:34:37
          node017</tt><tt><br>
        </tt><tt>$</tt><br>
      </p>
      <p>With SlurmdDebug=debug5, the only thing interesting in
        slurmd.log is</p>
      <p><tt>[2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic
          signature plugin loaded</tt><tt><br>
        </tt><tt>[2017-12-08T15:34:37.778] [90436.batch] error:
          pam_open_session: Cannot make/remove an entry for the
          specified session</tt><tt><br>
        </tt><tt>[2017-12-08T15:34:37.779] [90436.batch] error: error in
          pam_setup</tt><tt><br>
        </tt><tt>[2017-12-08T15:34:37.804] [90436.batch] error:
          job_manager exiting abnormally, rc = 4020</tt><tt><br>
        </tt><tt>[2017-12-08T15:34:37.804] [90436.batch] job 90436
          completed with slurm_rc = 4020, job_rc = 0</tt><br>
      </p>
      <p>/etc/pam.d/slurm is defined as</p>
      <p><tt>auth            required        pam_localuser.so</tt><tt><br>
        </tt><tt>auth            required        pam_shells.so</tt><tt><br>
        </tt><tt>account         required        pam_unix.so</tt><tt><br>
        </tt><tt>account         required        pam_access.so</tt><tt><br>
        </tt><tt>session         required        pam_unix.so</tt><tt><br>
        </tt><tt>session         required        pam_loginuid.so</tt><br>
      </p>
      <p>/var/log/secure reports</p>
      <p><tt>Dec  8 15:34:37 node017 : pam_unix(slurm:session):
          open_session - error recovering username</tt><tt><br>
        </tt><tt>Dec  8 15:34:37 node017 : pam_loginuid(slurm:session):
          unexpected response from failed conversation function</tt><tt><br>
        </tt><tt>Dec  8 15:34:37 node017 : pam_loginuid(slurm:session):
          error recovering login user-name</tt><br>
      </p>
      <p>The message "error recovering username" seems likely to be at
        the heart of the problem here. This worked just fine with Slurm
        16.05.8, and I think it was also working with Slurm
        17.11.0-0pre2.</p>
      <p>Any thoughts about where I should go from here?</p>
      <p>Andy</p>
      On 11/30/2017 08:40 AM, Andy Riebs wrote:<br>
      <blockquote type="cite"
        cite="mid:7ddd5996-f70a-a901-edbe-ef154b39e054@hpe.com"> <font
          face="Helvetica, Arial, sans-serif">We've just installed
          17.11.0 on our 100+ node x86_64 cluster running CentOS 7.4
          this afternoon, and periodically see a single node (perhaps
          the first node in an allocation?) get drained with the message
          "batch job complete failure".</font><br>
        <br>
        <font face="Helvetica, Arial, sans-serif">On one node in
          question, slurmd.log reports</font><br>
        <font face="Helvetica, Arial, sans-serif"> </font>
        <style>
<!--
 /* Font Definitions */
 @font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:"Segoe UI";
        panose-1:2 11 5 2 4 2 4 2 2 3;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin-top:0in;
        margin-right:0in;
        margin-bottom:8.0pt;
        margin-left:0in;
        line-height:107%;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
.MsoChpDefault
        {font-family:"Calibri",sans-serif;}
.MsoPapDefault
        {margin-bottom:8.0pt;
        line-height:107%;}
 /* Page Definitions */
 @page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
-->
</style>
        <div class="WordSection1">
          <blockquote><font face="Helvetica, Arial, sans-serif"><span
                style="font-size: 10pt; color: black;">pam_unix(slurm:session):
                open_session - error recovering username</span><span
                style="font-size: 10pt; color: black;"> <br>
                pam_loginuid(slurm:session): unexpected response from
                failed conversation function </span></font></blockquote>
        </div>
        <font face="Helvetica, Arial, sans-serif">On another node
          drained for the same reason,</font><br>
        <blockquote><font face="Helvetica, Arial, sans-serif">error:
            pam_open_session: Cannot make/remove an entry for the
            specified session<br>
            error: error in pam_setup<br>
            error: job_manager exiting abnormally, rc = 4020<br>
            sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0<br>
          </font></blockquote>
        <font face="Helvetica, Arial, sans-serif">slurmctld has logged</font><br>
        <font face="Helvetica, Arial, sans-serif"> </font>
        <style>
<!--
 /* Font Definitions */
 @font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:"Segoe UI";
        panose-1:2 11 5 2 4 2 4 2 2 3;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin-top:0in;
        margin-right:0in;
        margin-bottom:8.0pt;
        margin-left:0in;
        line-height:107%;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
.MsoChpDefault
        {font-family:"Calibri",sans-serif;}
.MsoPapDefault
        {margin-bottom:8.0pt;
        line-height:107%;}
 /* Page Definitions */
 @page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style>
        <div class="WordSection1"><font face="Helvetica, Arial,
            sans-serif"><span style="font-size: 10pt; color: black;"></span></font>
          <blockquote><font face="Helvetica, Arial, sans-serif"><span
                style="font-size: 10pt; color: black;"> error: slurmd
                error running JobId=33 on node(s)=node048: Slurmd could
                not execve job </span></font><br>
            <br>
            <font face="Helvetica, Arial, sans-serif"><span
                style="font-size: 10pt; color: black;">drain_nodes: node
                node048 state set to DRAIN</span></font></blockquote>
          <font face="Helvetica, Arial, sans-serif"><span
              style="font-size: 10pt; color: black;">If anyone can shine
              some light on where I should start looking, I shall be
              most obliged!</span></font><br>
          <br>
          <span style="font-size:10.0pt;font-family:"Segoe
            UI",sans-serif; color:black"><font face="Helvetica,
              Arial, sans-serif">Andy</font><br>
          </span><span style="font-size:10.0pt;font-family:"Segoe
            UI",sans-serif;color:black"> </span><br>
        </div>
        <pre class="moz-signature" cols="72">-- 
Andy Riebs
<a class="moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com" moz-do-not-send="true">andy.riebs@hpe.com</a>
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
    May the source be with you!
</pre>
      </blockquote>
      <br>
    </blockquote>
    <br>
  </body>
</html>