[slurm-users] Slurm 20.11.3, Suspended new connections while processing backlog filled /

Robert Kudyba rkudyba at fordham.edu
Wed Mar 10 16:09:48 UTC 2021


I see there is this exact issue
https://githubmemory.com/repo/dun/munge/issues/94. We are on Slurm 20.11.3
on Bright Cluster 8.1 on Centos 7.9

I found hundreds of these logs in slurmctld:
error: slurm_accept_msg_conn: Too many open files in system

Then in munged.log:
Suspended new connections while processing backlog

Also in slurmctld.log:
Mar 7 15:40:21 node003 nslcd[7941]: [18ed80] <group/member="root"> failed
to bind to LDAP server ldaps://ldapserver/: Can't contact LDAP server:
Connection timed out
Mar 7 15:40:21 node003 nslcd[7941]: [18ed80] <group/member="root"> no
available LDAP server found: Can't contact LDAP server: Connection timed out
Mar 7 15:40:30 node001 nsl cd[8838]: [53fb78] <group/member="root">
connected to LDAP server ldaps://ldapserver/
Mar 7 15:40:30 node003 nslcd[7941]: [b82726] <group/member="root"> no
available LDAP server found: Server is unavailable: Broken pipe
Mar 7 15:40:30 node003 nslcd[7941]: [b82726] <group/member="root"> no
available LDAP server found: Server is unavailable: Broken pipe

So / was 100%. Yes we should've put var on a separate partition.

As for file descriptor setting we have:
cat /proc/sys/fs/file-max
131072

Is there a way to avoid this in the future?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210310/9de060b1/attachment.htm>


More information about the slurm-users mailing list