[slurm-users] Areas for improvement on our site's cluster scheduling

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Tue May 8 05:36:51 MDT 2018


On 05/08/2018 09:49 AM, John Hearns wrote:
> Actually what IS bad is users not putting cluster resources to good use. 
> You can often see jobs which are 'stalled'  - ie the nodes are reserved 
> for the job,
> but the internal logic of the job has failed and the executables have 
> not launched. Or maybe some user is running an interactive job and has 
> wandered
> off for coffee/beer/an extended holiday.  It is well worth scanning for 
> stalled jobs and terminating them.

I agree, and the way I monitor our cluster for jobs that do little or no 
useful work is through my small utility "pestat" available from 
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat

I run "pestat -F" many times every day to spot inefficient jobs.

If I want to list the user processes belonging to a job, I use "psjob 
<jobid>".  I notify users and possibly cancel their jobs using the 
"notifybadjob <jobid>" script.  These tools are available at 
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs

/Ole



More information about the slurm-users mailing list