[slurm-users] How to deal with user running stuff in frontend node?

Petersen, Dirk petersen at fredhutch.org
Thu Feb 15 17:12:01 MST 2018


I think cgroups is prob more elegant  .......... but here is another script

https://github.com/FredHutch/IT/blob/master/py/loadwatcher.py#L59

The email text is hard coded so please change before using.   We put this in place in Oct 2017 when things where getting out of control because folks were using much more multithreaded software than before. Since then we had 95 users removed from one of the login nodes and several 100 warnings sent.  The killall -9 -v -g -u username
has been very effective. We have 3 login nodes with 28 cores and almost 400G RAM.

Dirk


-----Original Message-----
From: hpcxxxxxx at lists.fhcrc.org [mailto:hpcxxxxxxxxxxx at lists.fhcrc.org] On Behalf Of loadwatchxxxxxxxxxxx at fhcrc.org
Sent: Tuesday, November 14, 2017 11:45 AM
To: Doe, John <xxxxxxxxx @fredhutch.org>
Subject: [hpcpol] RHINO3: Your jobs have been removed!



This is a notification message from loadwatcher.py, running on host RHINO3. Please review the following message:



jdoe, your CPU utilization on rhino3 is currently 4499 %!



For short term jobs you can use no more than 400 % or 4.0 CPU cores on the Rhino machines.

We have removed all your processes from this computer.

Please try again and submit batch jobs

or use the 'grabnode' command for interactive jobs.



see http://scicomp.fhcrc.org/Gizmo%20Cluster%20Quickstart.aspx

or http://scicomp.fhcrc.org/Grab%20Commands.aspx

or http://scicomp.fhcrc.org/SciComp%20Office%20Hours.aspx



If output is being captured, you may find additional information in your logs.





Dirk Petersen
Scientific Computing Director
Fred Hutch
1100 Fairview Ave. N.
Mail Stop M4-A882
Seattle, WA 98109
Phone: 206.667.5926
Skype: internetchen

[cid:8C6A9079-96CB-447C-94D9-DD59438042C1]

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180216/933156fb/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 6982 bytes
Desc: image001.png
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180216/933156fb/attachment-0001.png>


More information about the slurm-users mailing list