Thanks for the suggestion Ole - I tried this out yesterday with RHEL 9.4 with two slightly different setups.
1) Using the stock ice driver that comes with RHEL 9.4 for the card still saw the issue.
2) There was not a pre-built version of the ice driver on the intel download site, so I built it myself, rebooted and re-ran the test. It did greatly reduced the number of occurrences of the issue - but didn't eliminate them.
This is similar to what I saw on the RHEL 9.3 setup (adding the intel ICE driver reduced occurrences but did not eliminate them entirely).
I can also report that the 23.02.7 tree had the similar results on the 9.3 node setup. Going backwards on the slurm bits did not seem to change the number of occurrences.
Unfortunately I think I'm out of time for experiments on these nodes, but maybe this thread will be useful to others down the road.
Brent
PS - sorry for my last post getting tagged as s new issue. Hopefully this one will thread correctly.