[slurm-users] cgroup limits not created for jobs

Christopher Samuel chris at csamuel.org
Mon Jul 27 05:35:08 UTC 2020

On 7/26/20 12:21 pm, Paul Raines wrote:

> Thank you so much.  This also explains my GPU CUDA_VISIBLE_DEVICES missing
> problem in my previous post.

I've missed that, but yes, that would do it.

> As a new SLURM admin, I am a bit suprised at this default behavior.
> Seems like a way for users to game the system by never running srun.

This is because by default salloc only requests a job allocation, it 
expects you to use srun to run an application on a compute node. But 
yes, it is non-obvious (as evidenced by the number of "sinteractive" and 
other scripts out there that folks have written not realising about the 
SallocDefaultCommand config option - I wrote one back in 2013!).

> The only limit I suppose that is being really enforced at that point
> is walltime?

Well the user isn't on the compute node so there's nothing really else 
to enforce.

> I guess I need to research srun and SallocDefaultCommand more, but is 
> there some way to set some kind of separate walltime limit on a
> job for the time a salloc has to run srun?  It is not clear if one
> can make a SallocDefaultCommand that does "srun ..." that really
> covers all possibilities.

An srun inside of a salloc (just like an sbatch) should not be able to 
exceed the time limit for the job allocation.

If it helps this is the SallocDefaultCommand we use for our GPU nodes:

srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 -G 0 --gpus-per-task=0 
--gpus-per-node=0 --gpus-per-socket=0  --pty --preserve-env --mpi=none 
-m block $SHELL

We have to give all those possible permutations to not use various GPU 
GRES as otherwise this srun will consume them if the salloc asked for it 
and then when the user tries to "srun" their application across the 
nodes it will block as there won't be any available on this first node.

Of course the fact that because of this the user can't see the GPUs 
without the srun can confuse some people, but it's unavoidable for this 
use case.

All the best,
   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

More information about the slurm-users mailing list