Hi Bjørn-Helge,
On 4/16/24 12:08, Bjørn-Helge Mevik via slurm-users wrote:
Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com writes:
Therefore I believe that the root cause of the present issue is user applications opening a lot of files on our 96-core nodes, and we need to increase fs.file-max.
You could also set a limit per user, for instance in /etc/security/limits.d/. Then users would be blocked from opening unreasonably many files. One could use this to find which applications are responsible, and try to get them fixed.
That sounds interesting, but which limit might affect the kernel's fs.file-max? For example, a user already has a narrow limit:
ulimit -n 1024
whereas the permitted number of user processes is a lot higher:
ulimit -u 3092846
I'm not sure how the number 3092846 got set, since it's not defined in /etc/security/limits.conf. The "ulimit -u" varies quite a bit among our compute nodes, so which dynamic service might affect the limits?
Perhaps there is a recommendation for defining nproc in /etc/security/limits.conf on compute nodes?
Thanks, Ole