<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p>I've gathered more information, and I am probably having a fight
with pam. First, of note, this problem can be reproduced with a
single node, single task job, such as</p>
<p><tt>$ sbatch -N1 --reservation awr </tt><tt><br>
</tt><tt>#!/bin/bash</tt><tt><br>
</tt><tt>hostname</tt><tt><br>
</tt><tt>Submitted batch job 90436</tt><tt><br>
</tt><tt>$ sinfo -R</tt><tt><br>
</tt><tt>batch job complete f slurm 2017-12-08T15:34:37
node017</tt><tt><br>
</tt><tt>$</tt><br>
</p>
<p>With SlurmdDebug=debug5, the only thing interesting in slurmd.log
is</p>
<p><tt>[2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic
signature plugin loaded</tt><tt><br>
</tt><tt>[2017-12-08T15:34:37.778] [90436.batch] error:
pam_open_session: Cannot make/remove an entry for the specified
session</tt><tt><br>
</tt><tt>[2017-12-08T15:34:37.779] [90436.batch] error: error in
pam_setup</tt><tt><br>
</tt><tt>[2017-12-08T15:34:37.804] [90436.batch] error:
job_manager exiting abnormally, rc = 4020</tt><tt><br>
</tt><tt>[2017-12-08T15:34:37.804] [90436.batch] job 90436
completed with slurm_rc = 4020, job_rc = 0</tt><br>
</p>
<p>/etc/pam.d/slurm is defined as</p>
<p><tt>auth required pam_localuser.so</tt><tt><br>
</tt><tt>auth required pam_shells.so</tt><tt><br>
</tt><tt>account required pam_unix.so</tt><tt><br>
</tt><tt>account required pam_access.so</tt><tt><br>
</tt><tt>session required pam_unix.so</tt><tt><br>
</tt><tt>session required pam_loginuid.so</tt><br>
</p>
<p>/var/log/secure reports</p>
<p><tt>Dec 8 15:34:37 node017 : pam_unix(slurm:session):
open_session - error recovering username</tt><tt><br>
</tt><tt>Dec 8 15:34:37 node017 : pam_loginuid(slurm:session):
unexpected response from failed conversation function</tt><tt><br>
</tt><tt>Dec 8 15:34:37 node017 : pam_loginuid(slurm:session):
error recovering login user-name</tt><br>
</p>
<p>The message "error recovering username" seems likely to be at the
heart of the problem here. This worked just fine with Slurm
16.05.8, and I think it was also working with Slurm 17.11.0-0pre2.</p>
<p>Any thoughts about where I should go from here?</p>
<p>Andy</p>
On 11/30/2017 08:40 AM, Andy Riebs wrote:<br>
<blockquote type="cite"
cite="mid:7ddd5996-f70a-a901-edbe-ef154b39e054@hpe.com">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<font face="Helvetica, Arial, sans-serif">We've just installed
17.11.0 on our 100+ node x86_64 cluster running CentOS 7.4 this
afternoon, and periodically see a single node (perhaps the first
node in an allocation?) get drained with the message "batch job
complete failure".</font><br>
<br>
<font face="Helvetica, Arial, sans-serif">On one node in question,
slurmd.log reports</font><br>
<font face="Helvetica, Arial, sans-serif"> </font>
<style>
<!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:"Segoe UI";
panose-1:2 11 5 2 4 2 4 2 2 3;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:0in;
line-height:107%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
.MsoChpDefault
{font-family:"Calibri",sans-serif;}
.MsoPapDefault
{margin-bottom:8.0pt;
line-height:107%;}
/* Page Definitions */
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
-->
</style>
<div class="WordSection1">
<blockquote><font face="Helvetica, Arial, sans-serif"><span
style="font-size: 10pt; color: black;">pam_unix(slurm:session):
open_session - error recovering username</span><span
style="font-size: 10pt; color: black;"> <br>
pam_loginuid(slurm:session): unexpected response from
failed conversation function </span></font></blockquote>
</div>
<font face="Helvetica, Arial, sans-serif">On another node drained
for the same reason,</font><br>
<blockquote><font face="Helvetica, Arial, sans-serif">error:
pam_open_session: Cannot make/remove an entry for the
specified session<br>
error: error in pam_setup<br>
error: job_manager exiting abnormally, rc = 4020<br>
sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0<br>
</font></blockquote>
<font face="Helvetica, Arial, sans-serif">slurmctld has logged</font><br>
<font face="Helvetica, Arial, sans-serif"> </font>
<style>
<!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:"Segoe UI";
panose-1:2 11 5 2 4 2 4 2 2 3;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:0in;
line-height:107%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
.MsoChpDefault
{font-family:"Calibri",sans-serif;}
.MsoPapDefault
{margin-bottom:8.0pt;
line-height:107%;}
/* Page Definitions */
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
<div class="WordSection1"><font face="Helvetica, Arial,
sans-serif"><span style="font-size: 10pt; color: black;"></span></font>
<blockquote><font face="Helvetica, Arial, sans-serif"><span
style="font-size: 10pt; color: black;"> error: slurmd
error running JobId=33 on node(s)=node048: Slurmd could
not execve job </span></font><br>
<br>
<font face="Helvetica, Arial, sans-serif"><span
style="font-size: 10pt; color: black;">drain_nodes: node
node048 state set to DRAIN</span></font></blockquote>
<font face="Helvetica, Arial, sans-serif"><span
style="font-size: 10pt; color: black;">If anyone can shine
some light on where I should start looking, I shall be most
obliged!</span></font><br>
<br>
<span style="font-size:10.0pt;font-family:"Segoe
UI",sans-serif; color:black"><font face="Helvetica,
Arial, sans-serif">Andy</font><br>
</span><span style="font-size:10.0pt;font-family:"Segoe
UI",sans-serif;color:black"> </span><br>
</div>
<pre class="moz-signature" cols="72">--
Andy Riebs
<a class="moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com" moz-do-not-send="true">andy.riebs@hpe.com</a>
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
</pre>
</blockquote>
<br>
</body>
</html>