What's the best way to troubleshoot this when orted fails but doesn't give any sort of error to indicate what the root cause of the failure might be?  And I also can't predictably induce the failure, just have to wait until it randomly chokes.


You can try increasing the Open MPI verbosity — generally and module-specific.  That's often how I am able to notice what's wrong under the hood with Open MPI.  Use the `ompi_info` command to check for all "verbose" parameters:


$ ompi_info --level 9 --all | grep _verbose