[slurm-users] cgroup limits not created for jobs
Paul Raines
raines at nmr.mgh.harvard.edu
Sun Jul 26 19:21:09 UTC 2020
On Sat, 25 Jul 2020 2:00am, Chris Samuel wrote:
> On Friday, 24 July 2020 9:48:35 AM PDT Paul Raines wrote:
>
>> But when I run a job on the node it runs I can find no
>> evidence in cgroups of any limits being set
>>
>> Example job:
>>
>> mlscgpu1[0]:~$ salloc -n1 -c3 -p batch --gres=gpu:quadro_rtx_6000:1 --mem=1G
>> salloc: Granted job allocation 17
>> mlscgpu1[0]:~$ echo $$
>> 137112
>> mlscgpu1[0]:~$
>
> You're not actually running inside a job at that point unless you've defined
> "SallocDefaultCommand" in your slurm.conf, and I'm guessing that's not the
> case there. You can make salloc fire up an srun for you in the allocation
> using that option, see the docs here:
>
> https://slurm.schedmd.com/slurm.conf.html#OPT_SallocDefaultCommand
>
Thank you so much. This also explains my GPU CUDA_VISIBLE_DEVICES missing
problem in my previous post.
As a new SLURM admin, I am a bit suprised at this default behavior.
Seems like a way for users to game the system by never running srun.
The only limit I suppose that is being really enforced at that point
is walltime?
I guess I need to research srun and SallocDefaultCommand more, but is
there some way to set some kind of separate walltime limit on a
job for the time a salloc has to run srun? It is not clear if one
can make a SallocDefaultCommand that does "srun ..." that really
covers all possibilities.
More information about the slurm-users
mailing list