[slurm-users] Areas for improvement on our site's cluster scheduling
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue May 8 05:36:51 MDT 2018
On 05/08/2018 09:49 AM, John Hearns wrote:
> Actually what IS bad is users not putting cluster resources to good use.
> You can often see jobs which are 'stalled' - ie the nodes are reserved
> for the job,
> but the internal logic of the job has failed and the executables have
> not launched. Or maybe some user is running an interactive job and has
> wandered
> off for coffee/beer/an extended holiday. It is well worth scanning for
> stalled jobs and terminating them.
I agree, and the way I monitor our cluster for jobs that do little or no
useful work is through my small utility "pestat" available from
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
I run "pestat -F" many times every day to spot inefficient jobs.
If I want to list the user processes belonging to a job, I use "psjob
<jobid>". I notify users and possibly cancel their jobs using the
"notifybadjob <jobid>" script. These tools are available at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs
/Ole
More information about the slurm-users
mailing list