[slurm-users] Correct way in sbatch/srun to switch primary UNIX group.
Viviano, Brad
Viviano.Brad at epa.gov
Wed Jul 17 11:50:23 UTC 2019
Our site has been going through the process of upgrading SLURM on our primary cluster which was delivered to us with Slurm 16.05 with Bright Computing. We're currently at 17.02.13-2 and working to get to 17.11 and then 18.08. We've run into an issue with 17.11 and switching effective GID on a sbatch/srun. I've only found one mention of this issue in the archive and no specific resolution:
https://groups.google.com/forum/#!topic/slurm-users/YZlTqBoMZ0o
Our site has a "many projects -> single user" mapping. So a given user is likely to be in 3+ projects, which map to corresponding SLURM accounts in sacctmgr. For each SLURM account, we create a corresponding POSSIX/UNIX group of the same name and setup a directory on our GPFS storage appropriately owned by that group, with a disk quota.
We made the switch to Slurm from Torque a few year back. In Torque we where using "-W group_list=" option to allow the user to change their effective GID to one of their auxiliary groups on a per job basis. In 17.02 and earlier, we've been using the --gid= option to similar effect, allowing users to switch their effective GID for a given job to one of their Auxiliary groups that matches the project they are burning time for.
On 17.02 the following works fine:
[login1] $ id
uid=1000(user1) gid=1000(users) groups=1000(users),1001(test)
[login1] $ srun --account=general --pty /bin/bash -i
[compute1] $ id --group
1000
[login1] $ srun --account=test --gid=test --pty /bin/bash -i
[compute1] $ id --group
1001
On 17.11, using --gid gets an error:
[login1] $ srun --account=test --gid=test --pty /bin/bash -i
srun: error: --gid only permitted by root user
The only work around I've found that mimics the same behavior is to use "newgrp" or "sg" on the login node, to switch the auxiliary group to be the effective group during submit:
[login1] $ sg test 'srun --account=test --pty /bin/bash -i'
[compute1] $ id --group
1001
I've reviewed the slurm-users archive, bug notes, etc and understand the reason the change was made to disallow --uid/--gid except for root. What I am looking for is information/suggestions on the best way to mimic the 17.02 and earlier functionality in a secure way.
I've already attempted to write a JobSubmit plugin and in the "extern int job_submit" function overwrite job_desc->group_id to use an alternate group ID based on job_desc->account. But that resulted in an error on the slurmd side:
[2019-07-16T13:26:20.073] error: job 95 credential created for gid 1001, expected 1000
[2019-07-16T13:26:20.073] error: Invalid job credential from 1000 at 172.20.2.2: Invalid job credential
I then attempted a spank plugin to add my own option, "--egid", to srun/sbatch and attempting to overwrite the GID that slurm picked up from the login node. I was able to bring in "--egid=test", resolve the group name to a GID number, but no matter which slurm_spank_* function I tried, using the "setegid" or "setgid" system calls didn't hold. Once the actual slurmstepd process started, the effective GID in the user process was 1000 (the GID used when sbatch/srun was run) instead of 1001.
I've been hoping there is something I missed either native to SLURM or in the JobSubmit/SPANK plugin that would let me have the ability to allow users to switch their effective GID on a per job basis to any of the groups they belong too.
Thanks,
-Brad Viviano
===================================================
Brad Viviano
Senior Systems Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190717/acea8313/attachment-0001.htm>
More information about the slurm-users
mailing list