[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"
Andy Riebs
andy.riebs at hpe.com
Fri Dec 8 13:19:44 MST 2017
Answering my own question, I got private email which points to
<https://bugs.schedmd.com/show_bug.cgi?id=4412>, describing both the
problem and the solution. (Thanks Matthieu!)
Andy
On 12/08/2017 11:06 AM, Andy Riebs wrote:
>
> I've gathered more information, and I am probably having a fight with
> pam. First, of note, this problem can be reproduced with a single
> node, single task job, such as
>
> $ sbatch -N1 --reservation awr
> #!/bin/bash
> hostname
> Submitted batch job 90436
> $ sinfo -R
> batch job complete f slurm 2017-12-08T15:34:37 node017
> $
>
> With SlurmdDebug=debug5, the only thing interesting in slurmd.log is
>
> [2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic signature
> plugin loaded
> [2017-12-08T15:34:37.778] [90436.batch] error: pam_open_session:
> Cannot make/remove an entry for the specified session
> [2017-12-08T15:34:37.779] [90436.batch] error: error in pam_setup
> [2017-12-08T15:34:37.804] [90436.batch] error: job_manager exiting
> abnormally, rc = 4020
> [2017-12-08T15:34:37.804] [90436.batch] job 90436 completed with
> slurm_rc = 4020, job_rc = 0
>
> /etc/pam.d/slurm is defined as
>
> auth required pam_localuser.so
> auth required pam_shells.so
> account required pam_unix.so
> account required pam_access.so
> session required pam_unix.so
> session required pam_loginuid.so
>
> /var/log/secure reports
>
> Dec 8 15:34:37 node017 : pam_unix(slurm:session): open_session -
> error recovering username
> Dec 8 15:34:37 node017 : pam_loginuid(slurm:session): unexpected
> response from failed conversation function
> Dec 8 15:34:37 node017 : pam_loginuid(slurm:session): error
> recovering login user-name
>
> The message "error recovering username" seems likely to be at the
> heart of the problem here. This worked just fine with Slurm 16.05.8,
> and I think it was also working with Slurm 17.11.0-0pre2.
>
> Any thoughts about where I should go from here?
>
> Andy
>
> On 11/30/2017 08:40 AM, Andy Riebs wrote:
>> We've just installed 17.11.0 on our 100+ node x86_64 cluster running
>> CentOS 7.4 this afternoon, and periodically see a single node
>> (perhaps the first node in an allocation?) get drained with the
>> message "batch job complete failure".
>>
>> On one node in question, slurmd.log reports
>>
>> pam_unix(slurm:session): open_session - error recovering username
>> pam_loginuid(slurm:session): unexpected response from failed
>> conversation function
>>
>> On another node drained for the same reason,
>>
>> error: pam_open_session: Cannot make/remove an entry for the
>> specified session
>> error: error in pam_setup
>> error: job_manager exiting abnormally, rc = 4020
>> sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0
>>
>> slurmctld has logged
>>
>> error: slurmd error running JobId=33 on node(s)=node048: Slurmd
>> could not execve job
>>
>> drain_nodes: node node048 state set to DRAIN
>>
>> If anyone can shine some light on where I should start looking, I
>> shall be most obliged!
>>
>> Andy
>>
>> --
>> Andy Riebs
>> andy.riebs at hpe.com
>> Hewlett-Packard Enterprise
>> High Performance Computing Software Engineering
>> +1 404 648 9024
>> My opinions are not necessarily those of HPE
>> May the source be with you!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171208/5b6e4751/attachment.html>
More information about the slurm-users
mailing list