[slurm-users] [External] Re: What is an easy way to prevent users run programs on the master/login node.

Tue Apr 27 15:45:49 UTC 2021

This is not a good approach. There's plenty of jobs you can run that 
will hog a systems resources without using MPI. MATLAB and Mathematica 
both support parallel computation, and don't need to use MPI to do so. 
Then there's OpenMP and other threaded applications that don't need 
mpirun/mpiexec to launch them.

Limiting the number of processes or threads is not the only concern. You 
can easily run a single-threaded tasks that hogs all the RAM. Or a user 
may use bbcp to transfer a large amount of data, choking the network 
interface.

Using cgroups is really the only reliable way to limit users, and 
Arbiter seems like the best way to automatically mange cgroup imposed 
limits.

I haven't used arbiter myself, but I've seen presentations on it, and 
I'm preparing to deploy it myself.

https://dylngg.github.io/resources/arbiterTechPaper.pdf

Prentice

On 4/25/21 3:46 AM, Patrick Begou wrote:
> Hi,
>
> I also saw a cluster setup where mpirun or mpiexec commands were
> replaced by a shell script just saying "please use srun or sbatch...".
>
> Patrick
>
> Le 24/04/2021 à 10:03, Ole Holm Nielsen a écrit :
>> On 24-04-2021 04:37, Cristóbal Navarro wrote:
>>> Hi Community,
>>> I have a set of users still not so familiar with slurm, and yesterday
>>> they bypassed srun/sbatch and just ran their CPU program directly on
>>> the head/login node thinking it would still run on the compute node.
>>> I am aware that I will need to teach them some basic usage, but in
>>> the meanwhile, how have you solved this type of user-behavior
>>> problem? Is there a preffered way to restrict the master/login
>>> resources, or actions,  to the regular users ?
>> We restrict user limits in /etc/security/limits.conf so users can't
>> run very long or very big tasks on the login nodes:
>>
>> # Normal user limits
>> *               hard    cpu             20
>> *               hard    rss             50000000
>> *               hard    data            50000000
>> *               soft    stack           40000000
>> *               hard    stack           50000000
>> *               hard    nproc           250
>>
>> /Ole
>>
>