[slurm-users] How to deal with user running stuff in frontend node?

Michael Jennings mej at lanl.gov
Thu Feb 15 09:06:39 MST 2018


On Thursday, 15 February 2018, at 16:11:29 (+0100),
Manuel Rodríguez Pascual wrote:

> Although this is not strictly related to Slurm, maybe you can recommend me
> some actions to deal with a particular user.
> 
> On our small cluster, currently there are no limits to run applications in
> the frontend. This is sometimes really useful for some users, for example
> to have scripts monitoring the execution of jobs and taking decisions
> depending on the partial results.
> 
> However, we have this user that keeps abusing this system: when the job
> queue is long and there is a significant time wait, he sometimes runs his
> jobs on the frontend, resulting on a CPU load of 100% and some delays on
> using it for the things it is supposed to serve (user login, monitoring and
> so).
> 
> Have you faced the same issue?  Is there any solution? I am thinking about
> using ulimit to limit the execution time of this jobs in the frontend to 5
> minutes or so. This however does not look so elegant as other users can
> perform the sabe abuse on the future, and he should also be able to run low
> cpu-consuming jobs for a longer period. However I am not an experienced
> sysadmin so I am completely open to suggestions or different ways of facing
> this issue.

I don't do this at my current job, but at my previous one, I used NHC
(https://github.com/mej/nhc) with a special config context I called
"patrol."  I ran "nhc-patrol" (symlinked to /usr/sbin/nhc) with the
following /etc/nhc/nhc-patrol.conf:

### Kill ANY user processes consuming 98% or more of a CPU core or 20+% of RAM
   ln* || check_ps_cpu -a -l -s -u '!root' -m '!/(^|\/)((mpi|i|g)?(cc|CC|fortran|f90))$/' 98%
   ln* || check_ps_physmem -a -l -s -u '!root' -m '!/(^|\/)((mpi|i|g)?(cc|CC|fortran|f90))$/' 20%

### Ban certain processes on login nodes, like OpenMPI's "orted"
### or various file transfer tools which belong on the DTN.
   ln* || check_ps_mem -a -k -u '!root' -m '/(^|\/)(scp|sftp-server|bbcp|ftp|lftp|ncftp|sftp|unison|rsync)$/' 1k
   ln* || check_ps_mem -a -k -u '!root' -m '/(^|\/)(orted|mpirun|mpiexec|MATLAB)$/' 1k

### Ban certain misbehaving Python scripts and known application binaries
   ln* || check_ps_mem -a -l -s -u '!root' -m '/(\.cplx\.x|xas\.x|volume.py|Calculate|TOUGH_|mcnpx|main|eco2n)/' 1G
   ln* || check_ps_mem -a -l -s -u '!root' -f -m '/(^|\/)(([^b][^a][^s][^h].*|)Calculate(Ratio|2PartCorr)|.*main config\.lua|.*python essai\.py|.*python \.?\/?input_hj\.py|.*python .*/volume\.py|java -jar.*(SWORD|MyProxyLogon).*)/' 1G
   ln* || check_ps_time -a -l -s -k -u '!root' -m '/(^|\/)(projwfc.x|xas_para.x|pp.x|pw.x|new_pw.x|xi0.cplx.x|sigma.cplx.x|xcton.cplx.x|diag.cplx.x|sapo.cplx.x|plotxct.cplx.x|forces.cplx.x|absorption.cplx.x|shirley_xas.x|wannier90.x|pw2bgw.x|metal_qp_single.x|epsbinasc.cplx.x|inteqp.cplx.x|summarize_eigenvectors.cplx.x|epsomega.cplx.x|epsilon.cplx.x|volume.py|ccsm.exe|pgdbg|pgserv|paratec.mpi|abinip|puppet|pyMPI|real.exe|denchar|nearBragg_mpi_test_8|vasp|cam|qcprog.exe|viewer|elk|bertini|namd2|Calculate2PartCorr|TOUGH_Shale|t2eos3_mp|pho|lmp_mftheory|t101|mcnpx.ngsi|chuk_code_mpi.exe|cape_calc|test_fortran_c_mixer|g_wham_d_mpi|tr2.087_eco2n_lnx|scph|phon|TRGw|tt2)$/' 1s

### Ban certain programs from specific naughty users
   ln* || check_ps_physmem -a -l -s -k -u baduser1 -m '*test' 1k
   ln* || check_ps_time -a -l -s -k -u baduser2 -m R 10s

### Prohibit non-file-transfer tools on the DTNs
 xfer* || check_ps_time -a -l -s -k -u '!root' -f -m '/ssh ln/' 1s
 xfer* || check_ps_time -a -l -s -u '!root' -f -m '!/\b(sshd: |ssh-agent|rsync|scp|ftp|htar|hsi|-?t?csh|-?bash|portmap|rpc|dbus-daemon|ntpd|xfs)\b/' 1s

---------------------------------------

Obviously this is something you'd need to tweak for your needs, and
some of these things are now better done via CGroups.  But not only
did it help me keep the login nodes clean, once upon a time, it also
helped me catch a would-be hacker!  :-)

HTH,
Michael

-- 
Michael E. Jennings <mej at lanl.gov>
HPC Systems Team, Los Alamos National Laboratory
Bldg. 03-2327, Rm. 2341     W: +1 (505) 606-0605



More information about the slurm-users mailing list