[slurm-users] cgroup limits not created for jobs

Sun Jul 26 19:21:09 UTC 2020

On Sat, 25 Jul 2020 2:00am, Chris Samuel wrote:

> On Friday, 24 July 2020 9:48:35 AM PDT Paul Raines wrote:
>
>> But when I run a job on the node it runs I can find no
>> evidence in cgroups of any limits being set
>>
>> Example job:
>>
>> mlscgpu1[0]:~$ salloc -n1 -c3 -p batch --gres=gpu:quadro_rtx_6000:1 --mem=1G
>> salloc: Granted job allocation 17
>> mlscgpu1[0]:~$ echo $$
>> 137112
>> mlscgpu1[0]:~$
>
> You're not actually running inside a job at that point unless you've defined
> "SallocDefaultCommand" in your slurm.conf, and I'm guessing that's not the
> case there.  You can make salloc fire up an srun for you in the allocation
> using that option, see the docs here:
>
> https://slurm.schedmd.com/slurm.conf.html#OPT_SallocDefaultCommand
>

Thank you so much.  This also explains my GPU CUDA_VISIBLE_DEVICES missing
problem in my previous post.

As a new SLURM admin, I am a bit suprised at this default behavior.
Seems like a way for users to game the system by never running srun.

The only limit I suppose that is being really enforced at that point
is walltime?

I guess I need to research srun and SallocDefaultCommand more, but is 
there some way to set some kind of separate walltime limit on a
job for the time a salloc has to run srun?  It is not clear if one
can make a SallocDefaultCommand that does "srun ..." that really
covers all possibilities.