<div dir="ltr"><div>I came here looking for this! The last time I tried it in early 2017-12 it was still "broken" with SLURM 17.11.0. Glad to see that it was fixed with 17.11.1 (and to know why). I've now got PAM limits being applied correctly on my cluster. Thanks for the link, Andy.<br></div><div><br></div><div>Cheers,<br></div></div><br><div class="gmail_quote"><div dir="ltr">On Fri, Dec 8, 2017 at 10:25 PM Andy Riebs <<a href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<p>Answering my own question, I got private email which points to
<a class="m_7314606177470746809moz-txt-link-rfc2396E" href="https://bugs.schedmd.com/show_bug.cgi?id=4412" target="_blank"><https://bugs.schedmd.com/show_bug.cgi?id=4412></a>, describing
both the problem and the solution. (Thanks Matthieu!)<br>
</p></div><div bgcolor="#FFFFFF" text="#000000">
<p>Andy<br>
</p></div><div bgcolor="#FFFFFF" text="#000000">
<br>
<div class="m_7314606177470746809moz-cite-prefix">On 12/08/2017 11:06 AM, Andy Riebs
wrote:<br>
</div>
<blockquote type="cite">
<p>I've gathered more information, and I am probably having a
fight with pam. First, of note, this problem can be reproduced
with a single node, single task job, such as</p>
<p><tt>$ sbatch -N1 --reservation awr </tt><tt><br>
</tt><tt>#!/bin/bash</tt><tt><br>
</tt><tt>hostname</tt><tt><br>
</tt><tt>Submitted batch job 90436</tt><tt><br>
</tt><tt>$ sinfo -R</tt><tt><br>
</tt><tt>batch job complete f slurm 2017-12-08T15:34:37
node017</tt><tt><br>
</tt><tt>$</tt><br>
</p>
<p>With SlurmdDebug=debug5, the only thing interesting in
slurmd.log is</p>
<p><tt>[2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic
signature plugin loaded</tt><tt><br>
</tt><tt>[2017-12-08T15:34:37.778] [90436.batch] error:
pam_open_session: Cannot make/remove an entry for the
specified session</tt><tt><br>
</tt><tt>[2017-12-08T15:34:37.779] [90436.batch] error: error in
pam_setup</tt><tt><br>
</tt><tt>[2017-12-08T15:34:37.804] [90436.batch] error:
job_manager exiting abnormally, rc = 4020</tt><tt><br>
</tt><tt>[2017-12-08T15:34:37.804] [90436.batch] job 90436
completed with slurm_rc = 4020, job_rc = 0</tt><br>
</p>
<p>/etc/pam.d/slurm is defined as</p>
<p><tt>auth required pam_localuser.so</tt><tt><br>
</tt><tt>auth required pam_shells.so</tt><tt><br>
</tt><tt>account required pam_unix.so</tt><tt><br>
</tt><tt>account required pam_access.so</tt><tt><br>
</tt><tt>session required pam_unix.so</tt><tt><br>
</tt><tt>session required pam_loginuid.so</tt><br>
</p>
<p>/var/log/secure reports</p>
<p><tt>Dec 8 15:34:37 node017 : pam_unix(slurm:session):
open_session - error recovering username</tt><tt><br>
</tt><tt>Dec 8 15:34:37 node017 : pam_loginuid(slurm:session):
unexpected response from failed conversation function</tt><tt><br>
</tt><tt>Dec 8 15:34:37 node017 : pam_loginuid(slurm:session):
error recovering login user-name</tt><br>
</p>
<p>The message "error recovering username" seems likely to be at
the heart of the problem here. This worked just fine with Slurm
16.05.8, and I think it was also working with Slurm
17.11.0-0pre2.</p>
<p>Any thoughts about where I should go from here?</p>
<p>Andy</p>
On 11/30/2017 08:40 AM, Andy Riebs wrote:<br>
<blockquote type="cite"> <font face="Helvetica, Arial, sans-serif">We've just installed
17.11.0 on our 100+ node x86_64 cluster running CentOS 7.4
this afternoon, and periodically see a single node (perhaps
the first node in an allocation?) get drained with the message
"batch job complete failure".</font><br>
<br>
<font face="Helvetica, Arial, sans-serif">On one node in
question, slurmd.log reports</font><br>
<font face="Helvetica, Arial, sans-serif"> </font>
<div class="m_7314606177470746809WordSection1">
<blockquote><font face="Helvetica, Arial, sans-serif"><span style="font-size:10pt;color:black">pam_unix(slurm:session):
open_session - error recovering username</span><span style="font-size:10pt;color:black"> <br>
pam_loginuid(slurm:session): unexpected response from
failed conversation function </span></font></blockquote>
</div>
<font face="Helvetica, Arial, sans-serif">On another node
drained for the same reason,</font><br>
<blockquote><font face="Helvetica, Arial, sans-serif">error:
pam_open_session: Cannot make/remove an entry for the
specified session<br>
error: error in pam_setup<br>
error: job_manager exiting abnormally, rc = 4020<br>
sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0<br>
</font></blockquote>
<font face="Helvetica, Arial, sans-serif">slurmctld has logged</font><br>
<font face="Helvetica, Arial, sans-serif"> </font>
<div class="m_7314606177470746809WordSection1"><font face="Helvetica, Arial,
sans-serif"><span style="font-size:10pt;color:black"></span></font>
<blockquote><font face="Helvetica, Arial, sans-serif"><span style="font-size:10pt;color:black"> error: slurmd
error running JobId=33 on node(s)=node048: Slurmd could
not execve job </span></font><br>
<br>
<font face="Helvetica, Arial, sans-serif"><span style="font-size:10pt;color:black">drain_nodes: node
node048 state set to DRAIN</span></font></blockquote>
<font face="Helvetica, Arial, sans-serif"><span style="font-size:10pt;color:black">If anyone can shine
some light on where I should start looking, I shall be
most obliged!</span></font><br>
<br>
<span><font face="Helvetica,
Arial, sans-serif">Andy</font><br>
</span><span> </span><br>
</div>
<pre class="m_7314606177470746809moz-signature" cols="72">--
Andy Riebs
<a class="m_7314606177470746809moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com" target="_blank">andy.riebs@hpe.com</a>
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
<a href="tel:(404)%20648-9024" value="+14046489024" target="_blank">+1 404 648 9024</a>
My opinions are not necessarily those of HPE
May the source be with you!
</pre>
</blockquote>
<br>
</blockquote>
<br>
</div></blockquote></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><p dir="ltr">Alan Orth<br>
<a href="mailto:alan.orth@gmail.com">alan.orth@gmail.com</a><br>
<a href="https://picturingjordan.com">https://picturingjordan.com</a><br>
<a href="https://englishbulgaria.net">https://englishbulgaria.net</a><br>
<a href="https://mjanja.ch">https://mjanja.ch</a></p>
</div>