[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

Fri Dec 8 13:19:44 MST 2017

Answering my own question, I got private email which points to 
<https://bugs.schedmd.com/show_bug.cgi?id=4412>, describing both the 
problem and the solution. (Thanks Matthieu!)

Andy

On 12/08/2017 11:06 AM, Andy Riebs wrote:
>
> I've gathered more information, and I am probably having a fight with 
> pam.  First, of note, this problem can be reproduced with a single 
> node, single task job, such as
>
> $ sbatch -N1 --reservation awr
> #!/bin/bash
> hostname
> Submitted batch job 90436
> $ sinfo -R
> batch job complete f slurm     2017-12-08T15:34:37 node017
> $
>
> With SlurmdDebug=debug5, the only thing interesting in slurmd.log is
>
> [2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic signature 
> plugin loaded
> [2017-12-08T15:34:37.778] [90436.batch] error: pam_open_session: 
> Cannot make/remove an entry for the specified session
> [2017-12-08T15:34:37.779] [90436.batch] error: error in pam_setup
> [2017-12-08T15:34:37.804] [90436.batch] error: job_manager exiting 
> abnormally, rc = 4020
> [2017-12-08T15:34:37.804] [90436.batch] job 90436 completed with 
> slurm_rc = 4020, job_rc = 0
>
> /etc/pam.d/slurm is defined as
>
> auth            required        pam_localuser.so
> auth            required        pam_shells.so
> account         required        pam_unix.so
> account         required        pam_access.so
> session         required        pam_unix.so
> session         required        pam_loginuid.so
>
> /var/log/secure reports
>
> Dec  8 15:34:37 node017 : pam_unix(slurm:session): open_session - 
> error recovering username
> Dec  8 15:34:37 node017 : pam_loginuid(slurm:session): unexpected 
> response from failed conversation function
> Dec  8 15:34:37 node017 : pam_loginuid(slurm:session): error 
> recovering login user-name
>
> The message "error recovering username" seems likely to be at the 
> heart of the problem here. This worked just fine with Slurm 16.05.8, 
> and I think it was also working with Slurm 17.11.0-0pre2.
>
> Any thoughts about where I should go from here?
>
> Andy
>
> On 11/30/2017 08:40 AM, Andy Riebs wrote:
>> We've just installed 17.11.0 on our 100+ node x86_64 cluster running 
>> CentOS 7.4 this afternoon, and periodically see a single node 
>> (perhaps the first node in an allocation?) get drained with the 
>> message "batch job complete failure".
>>
>> On one node in question, slurmd.log reports
>>
>>     pam_unix(slurm:session): open_session - error recovering username
>>     pam_loginuid(slurm:session): unexpected response from failed
>>     conversation function 
>>
>> On another node drained for the same reason,
>>
>>     error: pam_open_session: Cannot make/remove an entry for the
>>     specified session
>>     error: error in pam_setup
>>     error: job_manager exiting abnormally, rc = 4020
>>     sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0
>>
>> slurmctld has logged
>>
>>     error: slurmd error running JobId=33 on node(s)=node048: Slurmd
>>     could not execve job
>>
>>     drain_nodes: node node048 state set to DRAIN
>>
>> If anyone can shine some light on where I should start looking, I 
>> shall be most obliged!
>>
>> Andy
>>
>> -- 
>> Andy Riebs
>> andy.riebs at hpe.com
>> Hewlett-Packard Enterprise
>> High Performance Computing Software Engineering
>> +1 404 648 9024
>> My opinions are not necessarily those of HPE
>>      May the source be with you!
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171208/5b6e4751/attachment.html>