I'm running a small-ish slurm grid, 87 nodes with various hardware. On a few occasions lately users submitting jobs will get an orted error and the job fails. Try again a few hours later or the next day and the same job runs just fine.
Google-fu indicated it might be a DNS issue if for whatever reason a node couldn't figure out the address for other nodes in the job. So I populated the /etc/hosts on each node with a complete listing of all the nodes so there wouldn't be any reliance on DNS. And that very afternoon another job failed with orted. So it seems at least in my case DNS isn't the issue.
What's the best way to troubleshoot this when orted fails but doesn't give any sort of error to indicate what the root cause of the failure might be? And I also can't predictably induce the failure, just have to wait until it randomly chokes.
What's the best way to troubleshoot this when orted fails but doesn't give any sort of error to indicate what the root cause of the failure might be? And I also can't predictably induce the failure, just have to wait until it randomly chokes.
You can try increasing the Open MPI verbosity — generally and module-specific. That's often how I am able to notice what's wrong under the hood with Open MPI. Use the `ompi_info` command to check for all "verbose" parameters:
$ ompi_info --level 9 --all | grep _verbose