After upgrading to version 23.11.3 we started to get slammed with the following log messages from slurmctld
"error: validate_group: Could not find group with gid <id>"
This spans a handful of groups and repeats constantly, drowning out just about everything else. Attempting to do a lookup on the group shows that they exist on the scheduler node, same for all the submission and compute nodes. As far as I can tell, slurm should be able to locate the group in question.
Jobs submitted from users within those groups go through just fine. They get scheduled, run, and clean up no problem. I'm at a loss on where to look next.
Managed to narrow it down a little bit. Our groups file is pretty large and we have a handful of individual groups that are also quite large as shown below
[root@batch1 ~]# wc /etc/group 6075 6075 349457 /etc/group
[root@batch1 ~]# grep 8xxx2 /etc/group | wc -c 56959
It looks like one of the recent changes (https://github.com/SchedMD/slurm/commit/e1b4cdba70f7f1b5ac5335c572d9c4c79e6e...) migrated the old uid check to the dedicated `gid_from_uid` function. However, an important change with that migration is that we've lost this part of the old loop:
``` if (errno == ERANGE) { buflen *= 2; xrealloc(buf, buflen); continue; } ```
In doing so I think we're hitting a buffer limit. Trimming down our groups enough can get us back to normal operations, but unfortunately that's not a tenable solution.