[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"
Andy Riebs
andy.riebs at hpe.com
Wed Nov 29 16:53:29 MST 2017
We've just installed 17.11.0 on our 100+ node x86_64 cluster running
CentOS 7.4 this afternoon, and periodically see a single node (perhaps
the first node in an allocation?) get drained with the message "batch
job complete failure".
On one node in question, slurmd.log reports
pam_unix(slurm:session): open_session - error recovering username
pam_loginuid(slurm:session): unexpected response from failed
conversation function
On another node drained for the same reason,
error: pam_open_session: Cannot make/remove an entry for the
specified session
error: error in pam_setup
error: job_manager exiting abnormally, rc = 4020
sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0
slurmctld has logged
error: slurmd error running JobId=33 on node(s)=node048: Slurmd
could not execve job
drain_nodes: node Summer0c048 state set to DRAIN
It's been a long day (for other reasons), so I'll go dig into this
tomorrow. But if anyone can shine some light on where I should start
looking, I shall be most obliged!
Andy
--
Andy Riebs
andy.riebs at hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171129/6b0b85b7/attachment.html>
More information about the slurm-users
mailing list