[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"
Andy Riebs
andy.riebs at hpe.com
Fri Dec 8 09:06:47 MST 2017
I've gathered more information, and I am probably having a fight with
pam. First, of note, this problem can be reproduced with a single node,
single task job, such as
$ sbatch -N1 --reservation awr
#!/bin/bash
hostname
Submitted batch job 90436
$ sinfo -R
batch job complete f slurm 2017-12-08T15:34:37 node017
$
With SlurmdDebug=debug5, the only thing interesting in slurmd.log is
[2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic signature
plugin loaded
[2017-12-08T15:34:37.778] [90436.batch] error: pam_open_session: Cannot
make/remove an entry for the specified session
[2017-12-08T15:34:37.779] [90436.batch] error: error in pam_setup
[2017-12-08T15:34:37.804] [90436.batch] error: job_manager exiting
abnormally, rc = 4020
[2017-12-08T15:34:37.804] [90436.batch] job 90436 completed with
slurm_rc = 4020, job_rc = 0
/etc/pam.d/slurm is defined as
auth required pam_localuser.so
auth required pam_shells.so
account required pam_unix.so
account required pam_access.so
session required pam_unix.so
session required pam_loginuid.so
/var/log/secure reports
Dec 8 15:34:37 node017 : pam_unix(slurm:session): open_session - error
recovering username
Dec 8 15:34:37 node017 : pam_loginuid(slurm:session): unexpected
response from failed conversation function
Dec 8 15:34:37 node017 : pam_loginuid(slurm:session): error recovering
login user-name
The message "error recovering username" seems likely to be at the heart
of the problem here. This worked just fine with Slurm 16.05.8, and I
think it was also working with Slurm 17.11.0-0pre2.
Any thoughts about where I should go from here?
Andy
On 11/30/2017 08:40 AM, Andy Riebs wrote:
> We've just installed 17.11.0 on our 100+ node x86_64 cluster running
> CentOS 7.4 this afternoon, and periodically see a single node (perhaps
> the first node in an allocation?) get drained with the message "batch
> job complete failure".
>
> On one node in question, slurmd.log reports
>
> pam_unix(slurm:session): open_session - error recovering username
> pam_loginuid(slurm:session): unexpected response from failed
> conversation function
>
> On another node drained for the same reason,
>
> error: pam_open_session: Cannot make/remove an entry for the
> specified session
> error: error in pam_setup
> error: job_manager exiting abnormally, rc = 4020
> sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0
>
> slurmctld has logged
>
> error: slurmd error running JobId=33 on node(s)=node048: Slurmd
> could not execve job
>
> drain_nodes: node node048 state set to DRAIN
>
> If anyone can shine some light on where I should start looking, I
> shall be most obliged!
>
> Andy
>
> --
> Andy Riebs
> andy.riebs at hpe.com
> Hewlett-Packard Enterprise
> High Performance Computing Software Engineering
> +1 404 648 9024
> My opinions are not necessarily those of HPE
> May the source be with you!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171208/5830fe56/attachment.html>
More information about the slurm-users
mailing list