We have some new AMD EPYC compute nodes with 96 cores/node running RockyLinux 8.9. We've had a number of incidents where the Munge log-file /var/log/munge/munged.log suddenly fills up the root file system, after a while to 100% (tens of GBs), and the node eventually comes to a grinding halt! Wiping munged.log and restarting the node works around the issue.
I've tried to track down the symptoms and this is what I found:
1. In munged.log there are infinitely many lines filling up the disk:
2024-04-11 09:59:29 +0200 Info: Suspended new connections while processing backlog
2. The slurmd is not getting any responses from munged, even though we run "munged --num-threads 10". The slurmd.log displays errors like:
[2024-04-12T02:05:45.001] error: If munged is up, restart with --num-threads=10 [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: auth_g_create: RESPONSE_ACCT_GATHER_UPDATE has authentication error
3. The /var/log/messages displays the errors from slurmd as well as NetworkManager saying "Too many open files in system". The telltale syslog entry seems to be:
Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached
where the limit is confirmed in /proc/sys/fs/file-max.
We have never before seen any such errors from Munge. The error may perhaps be triggered by certain user codes (possibly star-ccm+) that might be opening a lot more files on the 96-core nodes than on nodes with a lower core count.
My workaround has been to edit the line in /etc/sysctl.conf:
fs.file-max = 131072
and update settings by "sysctl -p". We haven't seen any of the Munge errors since!
The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer version in https://github.com/dun/munge/releases/tag/munge-0.5.16 I can't figure out if 0.5.16 has a fix for the issue seen here?
Questions: Have other sites seen the present Munge issue as well? Are there any good recommendations for setting the fs.file-max parameter on Slurm compute nodes?
Thanks for sharing your insights, Ole