What's the best way to troubleshoot this when orted fails but doesn't give any sort of error to indicate what the root cause of the failure might be? And I also can't predictably induce the failure, just have to wait until it randomly chokes.
You can try increasing the Open MPI verbosity — generally and module-specific. That's often how I am able to notice what's wrong under the hood with Open MPI. Use the `ompi_info` command to check for all "verbose" parameters:
$ ompi_info --level 9 --all | grep _verbose