On 5/7/24 15:32, Henderson, Brent via slurm-users wrote:
Over the past few days I grabbed some time on the nodes and ran for a few hours. Looks like I **can** still hit the issue with cgroups disabled. Incident rate was 8 out of >11k jobs so dropped an order of magnitude or so. Guessing that exonerates cgroups as the cause, but possibly just a good way to tickle the real issue. Over the next few days, I’ll try to roll everything back to RHEL 8.9 and see how that goes.
My 2 cents: RHEL/AlmaLinux/RockyLinux 9.4 is out now, maybe it's worth a try to update to 9.4?
/Ole