[slurm-users] Hints, Cheatsheets, etc
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Jul 9 06:16:57 UTC 2019
Hi Edward,
Besides my Slurm Wiki page https://wiki.fysik.dtu.dk/niflheim/SLURM, I
have written a number of tools which we use for monitoring our cluster,
see https://github.com/OleHolmNielsen/Slurm_tools. I recommend in
particular these tools:
* pestat Prints a Slurm cluster nodes status with 1 line per node and
job info.
* showuserjobs Print the current node status and batch jobs status
broken down into userids.
Use the option "-p <partition>" to display partition data.
I recommend also this nice tool for displaying partition statistics:
* spart A user-oriented partition info command for slurm.
https://github.com/mercanca/spart
/Ole
On 7/8/19 9:33 PM, Edward Ned Harvey (slurm) wrote:
> I am an experienced sysadmin, new to being a slurm admin, and I'm
> encountering some difficulty:
>
> If you have a simple question such as "how many cpu's are currently
> being used in the foobar partition," or "give me an overview of the
> waiting jobs and what are the reasons they're waiting" I don't have any
> good easy ways yet to answer these questions. I can get the total number
> of cpu's in a partition via "scontrol show partition foobar" and I can
> get how many cpus are being used on a particular node via "scontrol show
> node somenode" and I can get a (not easily parsable) list of nodes
> within a partition via "sinfo". So all the information is available, but
> very difficult to access because it would require some very nontrivial
> parsing.
>
> I see projects like this: https://github.com/fasrc/slurm_showq and
> https://github.com/fasrc/scalc which seem to be created exactly for this
> purpose. They're trying to make information in slurm more easily accessible.
>
> So, is there a better way to manage a slurm cluster, are there better
> tools, or better ways to use them? Any other suggestions for me from
> experienced slurm admins? Like, a cheatsheet of common commands or
> scripts like slurm_showq and scalc? Or is this just the normal state of
> the world?
More information about the slurm-users
mailing list