On 7/15/24 10:43, William VINCENT via slurm-users wrote:
I am writing to report an issue with the Slurmctld process on our RHEL 9 (Rocky Linux) .
Twice in the past 5 days, the Slurmctld process has encountered an error that resulted in the service stopping. The error message displayed was "double free or corruption (out)". This error has caused significant disruption to our jobs, and we are concerned about its recurrence.
We have tried troubleshooting the issue, but we have not been able to identify the root cause of the problem. We would appreciate any assistance or guidance you can provide to help us resolve this issue.
Please let us know if you need any additional information or if there are any specific steps we should take to diagnose the problem further.
You're running Slurm 22.05.9 on RockyLinux 9 (is that Rocky 9.4 or what?). Such an old Slurm version probably hasn't been tested much on EL9 systems,
For security reasons you ought to upgrade to a recent Slurm version, just search for "CVE" in https://github.com/SchedMD/slurm/blob/master/NEWS to find out about security holes in older versions.
You can upgrade by 2 major releases in a single step, so you can go to 23.11.8. Upgrading Slurm is fairly easy, and I've collected various pieces of advice in the Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slur...
Hopefully a newer Slurm version is going to solve your issue.
I hope this helps, Ole