[slurm-users] Areas for improvement on our site's cluster scheduling

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Tue May 8 05:36:51 MDT 2018

On 05/08/2018 09:49 AM, John Hearns wrote:
> Actually what IS bad is users not putting cluster resources to good use. 
> You can often see jobs which are 'stalled'  - ie the nodes are reserved 
> for the job,
> but the internal logic of the job has failed and the executables have 
> not launched. Or maybe some user is running an interactive job and has 
> wandered
> off for coffee/beer/an extended holiday.  It is well worth scanning for 
> stalled jobs and terminating them.

I agree, and the way I monitor our cluster for jobs that do little or no 
useful work is through my small utility "pestat" available from 

I run "pestat -F" many times every day to spot inefficient jobs.

If I want to list the user processes belonging to a job, I use "psjob 
<jobid>".  I notify users and possibly cancel their jobs using the 
"notifybadjob <jobid>" script.  These tools are available at 


More information about the slurm-users mailing list