[slurm-users] Slurm missing non primary group memberships

Janne Blomqvist janne.blomqvist at aalto.fi
Tue Nov 20 04:12:59 MST 2018


On 10/11/2018 13.17, Douglas Jacobsen wrote:
> We've had issues getting sssd to work reliably on compute nodes (at 
> least at scale), the reason is not fully understood, but basically if 
> the connection times out with sssd it'll black list the server for 60s, 
> which then causes those kinds of issues.

In our experience sssd doesn't work reliably in large environments if 
user/group enumeration is enabled (the "enumerate" config option).

slurm used to require enumeration, but in

https://github.com/SchedMD/slurm/commit/48a4cdf8d9433b5655a26581768200e7a696ce87

I reworked the logic so that it should only be required in some special 
weird cases. But that patch was several years ago, hopefully whatever 
bugs were caused by it have been ironed out by now (*knocking on wood*).

> Setting LaunchParameters=send_gids will sidestep this issue by doing the 
> lookups exclusively on the controller node, where more frequent 
> connections can prevent time decay disconnections and reduce the 
> likelihood of cache misses.

This is probably good idea particularly if one has large parallel jobs, 
otherwise the nodes could DOS the AD/LDAP servers when launching if the 
cache is cold..


-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqvist at aalto.fi



More information about the slurm-users mailing list