[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

Thu Nov 30 10:38:24 MST 2017

Hi,

You should look at that bug : https://bugs.schedmd.com/show_bug.cgi?id=4412

I thought it would be resolved in 17.11.0.

Regards
Matthieu

Le 30 nov. 2017 00:56, "Andy Riebs" <andy.riebs at hpe.com> a écrit :

> We've just installed 17.11.0 on our 100+ node x86_64 cluster running
> CentOS 7.4 this afternoon, and periodically see a single node (perhaps the
> first node in an allocation?) get drained with the message "batch job
> complete failure".
>
> On one node in question, slurmd.log reports
>
> pam_unix(slurm:session): open_session - error recovering username
> pam_loginuid(slurm:session): unexpected response from failed conversation
> function
>
> On another node drained for the same reason,
>
> error: pam_open_session: Cannot make/remove an entry for the specified
> session
> error: error in pam_setup
> error: job_manager exiting abnormally, rc = 4020
> sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0
>
> slurmctld has logged
>
> error: slurmd error running JobId=33 on node(s)=node048: Slurmd could not
> execve job
>
> drain_nodes: node Summer0c048 state set to DRAIN
>
> It's been a long day (for other reasons), so I'll go dig into this
> tomorrow. But if anyone can shine some light on where I should start
> looking, I shall be most obliged!
>
> Andy
>
> --
> Andy Riebsandy.riebs at hpe.com
> Hewlett-Packard Enterprise
> High Performance Computing Software Engineering+1 404 648 9024 <(404)%20648-9024>
> My opinions are not necessarily those of HPE
>     May the source be with you!
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171130/e709df3d/attachment-0001.html>