<div dir="ltr"><div>I came here looking for this! The last time I tried it in early 2017-12 it was still "broken" with SLURM 17.11.0. Glad to see that it was fixed with 17.11.1 (and to know why). I've now got PAM limits being applied correctly on my cluster. Thanks for the link, Andy.<br></div><div><br></div><div>Cheers,<br></div></div><br><div class="gmail_quote"><div dir="ltr">On Fri, Dec 8, 2017 at 10:25 PM Andy Riebs <<a href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    <p>Answering my own question, I got private email which points to
      <a class="m_7314606177470746809moz-txt-link-rfc2396E" href="https://bugs.schedmd.com/show_bug.cgi?id=4412" target="_blank"><https://bugs.schedmd.com/show_bug.cgi?id=4412></a>, describing
      both the problem and the solution. (Thanks Matthieu!)<br>
    </p></div><div bgcolor="#FFFFFF" text="#000000">
    <p>Andy<br>
    </p></div><div bgcolor="#FFFFFF" text="#000000">
    <br>
    <div class="m_7314606177470746809moz-cite-prefix">On 12/08/2017 11:06 AM, Andy Riebs
      wrote:<br>
    </div>
    <blockquote type="cite">
      
      <p>I've gathered more information, and I am probably having a
        fight with pam.  First, of note, this problem can be reproduced
        with a single node, single task job, such as</p>
      <p><tt>$ sbatch -N1 --reservation awr </tt><tt><br>
        </tt><tt>#!/bin/bash</tt><tt><br>
        </tt><tt>hostname</tt><tt><br>
        </tt><tt>Submitted batch job 90436</tt><tt><br>
        </tt><tt>$ sinfo -R</tt><tt><br>
        </tt><tt>batch job complete f slurm     2017-12-08T15:34:37
          node017</tt><tt><br>
        </tt><tt>$</tt><br>
      </p>
      <p>With SlurmdDebug=debug5, the only thing interesting in
        slurmd.log is</p>
      <p><tt>[2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic
          signature plugin loaded</tt><tt><br>
        </tt><tt>[2017-12-08T15:34:37.778] [90436.batch] error:
          pam_open_session: Cannot make/remove an entry for the
          specified session</tt><tt><br>
        </tt><tt>[2017-12-08T15:34:37.779] [90436.batch] error: error in
          pam_setup</tt><tt><br>
        </tt><tt>[2017-12-08T15:34:37.804] [90436.batch] error:
          job_manager exiting abnormally, rc = 4020</tt><tt><br>
        </tt><tt>[2017-12-08T15:34:37.804] [90436.batch] job 90436
          completed with slurm_rc = 4020, job_rc = 0</tt><br>
      </p>
      <p>/etc/pam.d/slurm is defined as</p>
      <p><tt>auth            required        pam_localuser.so</tt><tt><br>
        </tt><tt>auth            required        pam_shells.so</tt><tt><br>
        </tt><tt>account         required        pam_unix.so</tt><tt><br>
        </tt><tt>account         required        pam_access.so</tt><tt><br>
        </tt><tt>session         required        pam_unix.so</tt><tt><br>
        </tt><tt>session         required        pam_loginuid.so</tt><br>
      </p>
      <p>/var/log/secure reports</p>
      <p><tt>Dec  8 15:34:37 node017 : pam_unix(slurm:session):
          open_session - error recovering username</tt><tt><br>
        </tt><tt>Dec  8 15:34:37 node017 : pam_loginuid(slurm:session):
          unexpected response from failed conversation function</tt><tt><br>
        </tt><tt>Dec  8 15:34:37 node017 : pam_loginuid(slurm:session):
          error recovering login user-name</tt><br>
      </p>
      <p>The message "error recovering username" seems likely to be at
        the heart of the problem here. This worked just fine with Slurm
        16.05.8, and I think it was also working with Slurm
        17.11.0-0pre2.</p>
      <p>Any thoughts about where I should go from here?</p>
      <p>Andy</p>
      On 11/30/2017 08:40 AM, Andy Riebs wrote:<br>
      <blockquote type="cite"> <font face="Helvetica, Arial, sans-serif">We've just installed
          17.11.0 on our 100+ node x86_64 cluster running CentOS 7.4
          this afternoon, and periodically see a single node (perhaps
          the first node in an allocation?) get drained with the message
          "batch job complete failure".</font><br>
        <br>
        <font face="Helvetica, Arial, sans-serif">On one node in
          question, slurmd.log reports</font><br>
        <font face="Helvetica, Arial, sans-serif"> </font>
        
        <div class="m_7314606177470746809WordSection1">
          <blockquote><font face="Helvetica, Arial, sans-serif"><span style="font-size:10pt;color:black">pam_unix(slurm:session):
                open_session - error recovering username</span><span style="font-size:10pt;color:black"> <br>
                pam_loginuid(slurm:session): unexpected response from
                failed conversation function </span></font></blockquote>
        </div>
        <font face="Helvetica, Arial, sans-serif">On another node
          drained for the same reason,</font><br>
        <blockquote><font face="Helvetica, Arial, sans-serif">error:
            pam_open_session: Cannot make/remove an entry for the
            specified session<br>
            error: error in pam_setup<br>
            error: job_manager exiting abnormally, rc = 4020<br>
            sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0<br>
          </font></blockquote>
        <font face="Helvetica, Arial, sans-serif">slurmctld has logged</font><br>
        <font face="Helvetica, Arial, sans-serif"> </font>
        
        <div class="m_7314606177470746809WordSection1"><font face="Helvetica, Arial,
            sans-serif"><span style="font-size:10pt;color:black"></span></font>
          <blockquote><font face="Helvetica, Arial, sans-serif"><span style="font-size:10pt;color:black"> error: slurmd
                error running JobId=33 on node(s)=node048: Slurmd could
                not execve job </span></font><br>
            <br>
            <font face="Helvetica, Arial, sans-serif"><span style="font-size:10pt;color:black">drain_nodes: node
                node048 state set to DRAIN</span></font></blockquote>
          <font face="Helvetica, Arial, sans-serif"><span style="font-size:10pt;color:black">If anyone can shine
              some light on where I should start looking, I shall be
              most obliged!</span></font><br>
          <br>
          <span><font face="Helvetica,
              Arial, sans-serif">Andy</font><br>
          </span><span> </span><br>
        </div>
        <pre class="m_7314606177470746809moz-signature" cols="72">-- 
Andy Riebs
<a class="m_7314606177470746809moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com" target="_blank">andy.riebs@hpe.com</a>
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
<a href="tel:(404)%20648-9024" value="+14046489024" target="_blank">+1 404 648 9024</a>
My opinions are not necessarily those of HPE
    May the source be with you!
</pre>
      </blockquote>
      <br>
    </blockquote>
    <br>
  </div></blockquote></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><p dir="ltr">Alan Orth<br>
<a href="mailto:alan.orth@gmail.com">alan.orth@gmail.com</a><br>
<a href="https://picturingjordan.com">https://picturingjordan.com</a><br>
<a href="https://englishbulgaria.net">https://englishbulgaria.net</a><br>
<a href="https://mjanja.ch">https://mjanja.ch</a></p>
</div>