[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"
Andy Riebs
andy.riebs at hpe.com
Thu Nov 30 06:40:34 MST 2017
We've just installed 17.11.0 on our 100+ node x86_64 cluster running
CentOS 7.4 this afternoon, and periodically see a single node (perhaps
the first node in an allocation?) get drained with the message "batch
job complete failure".
On one node in question, slurmd.log reports
pam_unix(slurm:session): open_session - error recovering username
pam_loginuid(slurm:session): unexpected response from failed
conversation function
On another node drained for the same reason,
error: pam_open_session: Cannot make/remove an entry for the
specified session
error: error in pam_setup
error: job_manager exiting abnormally, rc = 4020
sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0
slurmctld has logged
error: slurmd error running JobId=33 on node(s)=node048: Slurmd
could not execve job
drain_nodes: node Summer0c048 state set to DRAIN
If anyone can shine some light on where I should start looking, I shall
be most obliged!
Andy
--
Andy Riebs
andy.riebs at hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171130/01e428e3/attachment-0001.html>
More information about the slurm-users
mailing list