[slurm-users] How to deal with user running stuff in frontend node?

Thu Feb 15 15:12:21 MST 2018

I had previously contacted Ryan Cox about his solution and worked with 
it a little to implement it on our CentOS 7 cluster.  While I liked his 
solution, I felt it was a little complex for our needs.

I'm a big fan of keeping stuff real simple, so I came up with two simple 
shell scripts to solve the issue.

I have ulimits set to 10 minutes of cpu time.

One script runs continuously and runs:
# systemctl set-property user-$userid.slice CPUQuota=200%
# systemctl set-property user-$userid.slice CPUShares=256
# systemctl set-property user-$userid.slice MemoryLimit=4294967296

... for each user that it has discovered logged in.  This essentially 
sets the max amount of CPU cores and memory that the user can use.

The other script runs every 5 minutes and looks through /proc to find 
any processes that I want to exceed the ulimit.  I have an array of 
processes (like: tar, bzip2, rsync, scp, etc) that I don't mind if the 
user exceeds 10 minutes of cputime.  This script looks for these 
processes and runs:

# prlimit --pid $PID --cpu=unlimited

That way ulimits don't apply to those applications.

It's actually worked so well that I had totally forgotten about it until 
I saw this thread.  If you'd like a copy of the shell scripts, just send 
me an e-mail.

---

Nicholas McCollum - HPC Systems Expert
Alabama Supercomputer Authority - CSRA

On 02/15/2018 03:05 PM, Ryan Cox wrote:
> Manuel,
> 
> We set up cgroups and also do cputime limits (60 minutes in our case) in 
> limits.conf.  Before libcgroup had support for more generic "apply to 
> each user" kind of thing, I created a pam module that handles all of 
> that which still works well for creating per-user limits.  We also have 
> something that whitelists various file transfer programs so they aren't 
> subject to cputime limits.  We include an oom notifier daemon so that 
> users are alerted when their cgroup runs out of memory since many people 
> would otherwise have a tough time figuring out the exact cause of the 
> "Killed" message. All of this is available in 
> https://github.com/BYUHPC/uft (see the "Recommended Configuration" 
> section in the README.md for "Login Nodes").
> 
> We've had this in place for years and pretty much don't even have to 
> think about this anymore.  No complaints either.
> 
> If I had a user abusing the system after a warning I would probably 
> either kick him off for a cooling off period and/or implement a very 
> strict cputime limit (10 minutes?) in limits.conf just for him. Just my 
> $0.02.
> 
> Ryan
> 
> On 02/15/2018 08:11 AM, Manuel Rodríguez Pascual wrote:
>> Hi all,
>>
>> Although this is not strictly related to Slurm, maybe you can 
>> recommend me some actions to deal with a particular user.
>>
>> On our small cluster, currently there are no limits to run 
>> applications in the frontend. This is sometimes really useful for some 
>> users, for example to have scripts monitoring the execution of jobs 
>> and taking decisions depending on the partial results.
>>
>> However, we have this user that keeps abusing this system: when the 
>> job queue is long and there is a significant time wait, he sometimes 
>> runs his jobs on the frontend, resulting on a CPU load of 100% and 
>> some delays on using it for the things it is supposed to serve (user 
>> login, monitoring and so).
>>
>> Have you faced the same issue?  Is there any solution? I am thinking 
>> about using ulimit to limit the execution time of this jobs in the 
>> frontend to 5 minutes or so. This however does not look so elegant as 
>> other users can perform the sabe abuse on the future, and he should 
>> also be able to run low cpu-consuming jobs for a longer period. 
>> However I am not an experienced sysadmin so I am completely open to 
>> suggestions or different ways of facing this issue.
>>
>> Any thoughts?
>>
>> cheers,
>>
>>
>>
>>
>> Manuel
>