[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

Fri Dec 8 09:06:47 MST 2017

I've gathered more information, and I am probably having a fight with 
pam.  First, of note, this problem can be reproduced with a single node, 
single task job, such as

$ sbatch -N1 --reservation awr
#!/bin/bash
hostname
Submitted batch job 90436
$ sinfo -R
batch job complete f slurm     2017-12-08T15:34:37 node017
$

With SlurmdDebug=debug5, the only thing interesting in slurmd.log is

[2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic signature 
plugin loaded
[2017-12-08T15:34:37.778] [90436.batch] error: pam_open_session: Cannot 
make/remove an entry for the specified session
[2017-12-08T15:34:37.779] [90436.batch] error: error in pam_setup
[2017-12-08T15:34:37.804] [90436.batch] error: job_manager exiting 
abnormally, rc = 4020
[2017-12-08T15:34:37.804] [90436.batch] job 90436 completed with 
slurm_rc = 4020, job_rc = 0

/etc/pam.d/slurm is defined as

auth            required        pam_localuser.so
auth            required        pam_shells.so
account         required        pam_unix.so
account         required        pam_access.so
session         required        pam_unix.so
session         required        pam_loginuid.so

/var/log/secure reports

Dec  8 15:34:37 node017 : pam_unix(slurm:session): open_session - error 
recovering username
Dec  8 15:34:37 node017 : pam_loginuid(slurm:session): unexpected 
response from failed conversation function
Dec  8 15:34:37 node017 : pam_loginuid(slurm:session): error recovering 
login user-name

The message "error recovering username" seems likely to be at the heart 
of the problem here. This worked just fine with Slurm 16.05.8, and I 
think it was also working with Slurm 17.11.0-0pre2.

Any thoughts about where I should go from here?

Andy

On 11/30/2017 08:40 AM, Andy Riebs wrote:
> We've just installed 17.11.0 on our 100+ node x86_64 cluster running 
> CentOS 7.4 this afternoon, and periodically see a single node (perhaps 
> the first node in an allocation?) get drained with the message "batch 
> job complete failure".
>
> On one node in question, slurmd.log reports
>
>     pam_unix(slurm:session): open_session - error recovering username
>     pam_loginuid(slurm:session): unexpected response from failed
>     conversation function 
>
> On another node drained for the same reason,
>
>     error: pam_open_session: Cannot make/remove an entry for the
>     specified session
>     error: error in pam_setup
>     error: job_manager exiting abnormally, rc = 4020
>     sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0
>
> slurmctld has logged
>
>     error: slurmd error running JobId=33 on node(s)=node048: Slurmd
>     could not execve job
>
>     drain_nodes: node node048 state set to DRAIN
>
> If anyone can shine some light on where I should start looking, I 
> shall be most obliged!
>
> Andy
>
> -- 
> Andy Riebs
> andy.riebs at hpe.com
> Hewlett-Packard Enterprise
> High Performance Computing Software Engineering
> +1 404 648 9024
> My opinions are not necessarily those of HPE
>      May the source be with you!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171208/5830fe56/attachment.html>