[slurm-users] Slurm missing non primary group memberships

Wed Nov 14 10:14:21 MST 2018

Sorry for the late response. 

The erratic behaviour of group affiliations seemed inconsistent. Some of
the nodes experienced such oddity while other nodes were able to display
the group memberships properly.
I tried setting LaunchParameters=send_gids as per Douglas Jacobsen's
suggestion by doing1. Changing slurm.conf to add LaunchParameters=send_gids
2. Synchronising slurm.conf across the cluster.
3. Restarting slurmctld
4. scontrol reconfigure

and so far, it appears to have made our problem go away. Users's group
affiliations reflect correctly as expected.
Thanks,
--
  Aravindh Sampathkumar
  aravindh at fastmail.com

On Tue, Nov 13, 2018, at 10:21 AM, Antony Cleave wrote:
> Are you sure this isn't working as designed? 
> 
> I remember there is something annoying about groups in the manual.
> Here it is. This is why I prefer accounts.> 
> *NOTE:* For performance reasons, Slurm maintains a list of user IDs
> allowed to use each partition and this is checked at job submission
> time. This list of user IDs is updated when the *slurmctld*daemon is
> restarted, reconfigured (e.g. "scontrol reconfig") or the partition's
> *AllowGroups* value is reset, even if is value is unchanged (e.g.
> "scontrol update PartitionName=name AllowGroups=group"). For a user's
> access to a partition to change, both his group membership must change
> and Slurm's internal user ID list must change using one of the methods
> described above.> 
> Are you adding groups after submission too? Does changing allow groups
> on the partition fix it too?> 
> Antony
> 
> On Tue, 13 Nov 2018, 09:13 Joerg Sassmannshausen
> <joerg.sassmannshausen at crick.ac.uk wrote:>> Dear all,
>> 
>>  I am wondering if that is the same issue we are having here as well.>>  When I am adding users in the secondary group some time *after* the>>  initial user installation, the user cannot access the slurm
>>  partition it>>  suppose to. We found two remedies here, more or less by chance:
>>  - rebooting both the slurm server and slurm DB server
>>  - be patient and wait for long enough
>> 
>>  Obviously, both remedies are not suitable if you are running a large>>  research environment. The reboot was happening as we physically
>>  had to>>  move the servers and the waiting for long enough was simply as
>>  we did>>  not have an answer to the question.
>>  As already mentioned in a different posting, we have deleted the
>>  user in>>  slurm and re-installed it, updated the sssd on the slurm server,
>>  all in>>  vain.
>> 
>>  However, reading the threat, the latter case points to a caching
>>  problem, similar to the one described here. We are also using
>>  FreeIPA>>  and hence sssd for the ID lookup.
>> 
>>  Poking the list a bit further on this subject: does anybody have
>>  similar>>  experiences when the lookup is done directly on AD? We are
>>  planning to>>  move to AD and if that is also an issue at least are warned here.
>> 
>>  All the best
>> 
>>  Jörg
>> 
>>  On 10/11/18 11:17, Douglas Jacobsen wrote:
>>  > We've had issues getting sssd to work reliably on compute
>>  > nodes (at>>  > least at scale), the reason is not fully understood, but
>>  > basically if>>  > the connection times out with sssd it'll black list the server for
>>  > 60s,>>  > which then causes those kinds of issues.
>>  >
>>  > Setting LaunchParameters=send_gids will sidestep this issue by
>>  > doing the>>  > lookups exclusively on the controller node, where more frequent
>>  > connections can prevent time decay disconnections and reduce the
>>  > likelihood of cache misses.
>>  >
>>  > On Fri, Nov 9, 2018 at 11:16 PM Chris Samuel <chris at csamuel.org
>>  > <mailto:chris at csamuel.org>> wrote:
>>  >
>>  >     On Friday, 9 November 2018 2:47:51 AM AEDT Aravindh
>>  >     Sampathkumar wrote:>>  >
>>  >     > navtp at console2:~> ssh c07b07 id
>>  >     > uid=29865(navtp) gid=510(finland)
>>  >     groups=510(finland),508(nav),5001(ghpc)
>>  >     > context=unconfined_u:unconfined_r:unconfined_t:s0-
>>  >     > s0:c0.c1023>>  >
>>  >     Do you have SElinux configured by some chance?
>>  >
>>  >     If so you might want to check if it works with it disabled
>>  >     first..>>  >
>>  >     All the best,
>>  >     Chris
>>  >     --
>>  >      Chris Samuel  :  http://www.csamuel.org/
>>  >     <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=02%7C01%7C%7Cbf873add236a4bc74b0a08d646ff523c%7C4eed7807ebad415aa7a99170947f4eae%7C0%7C0%7C636774459751813515&sdata=L5%2Fg8HVibwr3xnv4%2FzlnwMBj8HgMlytUYposfbGi%2Bq8%3D&reserved=0>>>  >     :  Melbourne, VIC
>>  >
>>  >
>>  >
>>  >
>>  > --
>>  > Sent from Gmail Mobile
>> 
>>  --
>>  Dr. Jörg Saßmannshausen, MRSC
>>  HPC & Research Data System Engineer
>>  Scientific Computing
>>  The Francis Crick Institute
>>  1 Midland Way
>>  London, NW1 1AT
>>  email: joerg.sassmannshausen at crick.ac.uk
>>  phone: 020 379 65139
>>  The Francis Crick Institute Limited is a registered charity in
>>  England and Wales no. 1140062 and a company registered in England
>>  and Wales no. 06885462, with its registered office at 1 Midland Road
>>  London NW1 1AT
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181114/a08af89a/attachment-0001.html>