<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <p>I've gathered more information, and I am probably having a fight
      with pam.  First, of note, this problem can be reproduced with a
      single node, single task job, such as</p>
    <p><tt>$ sbatch -N1 --reservation awr </tt><tt><br>
      </tt><tt>#!/bin/bash</tt><tt><br>
      </tt><tt>hostname</tt><tt><br>
      </tt><tt>Submitted batch job 90436</tt><tt><br>
      </tt><tt>$ sinfo -R</tt><tt><br>
      </tt><tt>batch job complete f slurm     2017-12-08T15:34:37
        node017</tt><tt><br>
      </tt><tt>$</tt><br>
    </p>
    <p>With SlurmdDebug=debug5, the only thing interesting in slurmd.log
      is</p>
    <p><tt>[2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic
        signature plugin loaded</tt><tt><br>
      </tt><tt>[2017-12-08T15:34:37.778] [90436.batch] error:
        pam_open_session: Cannot make/remove an entry for the specified
        session</tt><tt><br>
      </tt><tt>[2017-12-08T15:34:37.779] [90436.batch] error: error in
        pam_setup</tt><tt><br>
      </tt><tt>[2017-12-08T15:34:37.804] [90436.batch] error:
        job_manager exiting abnormally, rc = 4020</tt><tt><br>
      </tt><tt>[2017-12-08T15:34:37.804] [90436.batch] job 90436
        completed with slurm_rc = 4020, job_rc = 0</tt><br>
    </p>
    <p>/etc/pam.d/slurm is defined as</p>
    <p><tt>auth            required        pam_localuser.so</tt><tt><br>
      </tt><tt>auth            required        pam_shells.so</tt><tt><br>
      </tt><tt>account         required        pam_unix.so</tt><tt><br>
      </tt><tt>account         required        pam_access.so</tt><tt><br>
      </tt><tt>session         required        pam_unix.so</tt><tt><br>
      </tt><tt>session         required        pam_loginuid.so</tt><br>
    </p>
    <p>/var/log/secure reports</p>
    <p><tt>Dec  8 15:34:37 node017 : pam_unix(slurm:session):
        open_session - error recovering username</tt><tt><br>
      </tt><tt>Dec  8 15:34:37 node017 : pam_loginuid(slurm:session):
        unexpected response from failed conversation function</tt><tt><br>
      </tt><tt>Dec  8 15:34:37 node017 : pam_loginuid(slurm:session):
        error recovering login user-name</tt><br>
    </p>
    <p>The message "error recovering username" seems likely to be at the
      heart of the problem here. This worked just fine with Slurm
      16.05.8, and I think it was also working with Slurm 17.11.0-0pre2.</p>
    <p>Any thoughts about where I should go from here?</p>
    <p>Andy</p>
    On 11/30/2017 08:40 AM, Andy Riebs wrote:<br>
    <blockquote type="cite"
      cite="mid:7ddd5996-f70a-a901-edbe-ef154b39e054@hpe.com">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <font face="Helvetica, Arial, sans-serif">We've just installed
        17.11.0 on our 100+ node x86_64 cluster running CentOS 7.4 this
        afternoon, and periodically see a single node (perhaps the first
        node in an allocation?) get drained with the message "batch job
        complete failure".</font><br>
      <br>
      <font face="Helvetica, Arial, sans-serif">On one node in question,
        slurmd.log reports</font><br>
      <font face="Helvetica, Arial, sans-serif"> </font>
      <style>
<!--
 /* Font Definitions */
 @font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:"Segoe UI";
        panose-1:2 11 5 2 4 2 4 2 2 3;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin-top:0in;
        margin-right:0in;
        margin-bottom:8.0pt;
        margin-left:0in;
        line-height:107%;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
.MsoChpDefault
        {font-family:"Calibri",sans-serif;}
.MsoPapDefault
        {margin-bottom:8.0pt;
        line-height:107%;}
 /* Page Definitions */
 @page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
-->
</style>
      <div class="WordSection1">
        <blockquote><font face="Helvetica, Arial, sans-serif"><span
              style="font-size: 10pt; color: black;">pam_unix(slurm:session):
              open_session - error recovering username</span><span
              style="font-size: 10pt; color: black;"> <br>
              pam_loginuid(slurm:session): unexpected response from
              failed conversation function </span></font></blockquote>
      </div>
      <font face="Helvetica, Arial, sans-serif">On another node drained
        for the same reason,</font><br>
      <blockquote><font face="Helvetica, Arial, sans-serif">error:
          pam_open_session: Cannot make/remove an entry for the
          specified session<br>
          error: error in pam_setup<br>
          error: job_manager exiting abnormally, rc = 4020<br>
          sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0<br>
        </font></blockquote>
      <font face="Helvetica, Arial, sans-serif">slurmctld has logged</font><br>
      <font face="Helvetica, Arial, sans-serif"> </font>
      <style>
<!--
 /* Font Definitions */
 @font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:"Segoe UI";
        panose-1:2 11 5 2 4 2 4 2 2 3;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin-top:0in;
        margin-right:0in;
        margin-bottom:8.0pt;
        margin-left:0in;
        line-height:107%;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
.MsoChpDefault
        {font-family:"Calibri",sans-serif;}
.MsoPapDefault
        {margin-bottom:8.0pt;
        line-height:107%;}
 /* Page Definitions */
 @page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style>
      <div class="WordSection1"><font face="Helvetica, Arial,
          sans-serif"><span style="font-size: 10pt; color: black;"></span></font>
        <blockquote><font face="Helvetica, Arial, sans-serif"><span
              style="font-size: 10pt; color: black;"> error: slurmd
              error running JobId=33 on node(s)=node048: Slurmd could
              not execve job </span></font><br>
          <br>
          <font face="Helvetica, Arial, sans-serif"><span
              style="font-size: 10pt; color: black;">drain_nodes: node
              node048 state set to DRAIN</span></font></blockquote>
        <font face="Helvetica, Arial, sans-serif"><span
            style="font-size: 10pt; color: black;">If anyone can shine
            some light on where I should start looking, I shall be most
            obliged!</span></font><br>
        <br>
        <span style="font-size:10.0pt;font-family:"Segoe
          UI",sans-serif; color:black"><font face="Helvetica,
            Arial, sans-serif">Andy</font><br>
        </span><span style="font-size:10.0pt;font-family:"Segoe
          UI",sans-serif;color:black"> </span><br>
      </div>
      <pre class="moz-signature" cols="72">-- 
Andy Riebs
<a class="moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com" moz-do-not-send="true">andy.riebs@hpe.com</a>
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
    May the source be with you!
</pre>
    </blockquote>
    <br>
  </body>
</html>