[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

Wed Nov 29 16:53:29 MST 2017

We've just installed 17.11.0 on our 100+ node x86_64 cluster running 
CentOS 7.4 this afternoon, and periodically see a single node (perhaps 
the first node in an allocation?) get drained with the message "batch 
job complete failure".

On one node in question, slurmd.log reports

    pam_unix(slurm:session): open_session - error recovering username
    pam_loginuid(slurm:session): unexpected response from failed
    conversation function 

On another node drained for the same reason,

    error: pam_open_session: Cannot make/remove an entry for the
    specified session
    error: error in pam_setup
    error: job_manager exiting abnormally, rc = 4020
    sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0

slurmctld has logged

    error: slurmd error running JobId=33 on node(s)=node048: Slurmd
    could not execve job

    drain_nodes: node Summer0c048 state set to DRAIN

It's been a long day (for other reasons), so I'll go dig into this 
tomorrow. But if anyone can shine some light on where I should start 
looking, I shall be most obliged!

Andy

-- 
Andy Riebs
andy.riebs at hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
     May the source be with you!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171129/6b0b85b7/attachment.html>