[slurm-users] Munge log-file fills up the file system to 100%

15 Apr 2024


      We have some new AMD EPYC compute nodes with 96 cores/node running 
RockyLinux 8.9.  We've had a number of incidents where the Munge log-file 
/var/log/munge/munged.log suddenly fills up the root file system, after a 
while to 100% (tens of GBs), and the node eventually comes to a grinding 
halt!  Wiping munged.log and restarting the node works around the issue.
I've tried to track down the symptoms and this is what I found:
1. In munged.log there are infinitely many lines filling up the disk:
2024-04-11 09:59:29 +0200 Info:      Suspended new connections while 
processing backlog
2. The slurmd is not getting any responses from munged, even though we run
    "munged --num-threads 10".  The slurmd.log displays errors like:
[2024-04-12T02:05:45.001] error: If munged is up, restart with 
--num-threads=10
    [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to 
connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable
    [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: 
auth_g_create: RESPONSE_ACCT_GATHER_UPDATE has authentication error
3. The /var/log/messages displays the errors from slurmd as well as
    NetworkManager saying "Too many open files in system".
    The telltale syslog entry seems to be:
Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached
where the limit is confirmed in /proc/sys/fs/file-max.
We have never before seen any such errors from Munge.  The error may 
perhaps be triggered by certain user codes (possibly star-ccm+) that might 
be opening a lot more files on the 96-core nodes than on nodes with a 
lower core count.
My workaround has been to edit the line in /etc/sysctl.conf:
fs.file-max = 131072
and update settings by "sysctl -p".  We haven't seen any of the Munge 
errors since!
The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer 
version in https://github.com/dun/munge/releases/tag/munge-0.5.16
I can't figure out if 0.5.16 has a fix for the issue seen here?
Questions: Have other sites seen the present Munge issue as well?  Are 
there any good recommendations for setting the fs.file-max parameter on 
Slurm compute nodes?
Thanks for sharing your insights,
Ole
-- 
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

2025

2024

[slurm-users] Munge log-file fills up the file system to 100%