[slurm-users] Tool for profiling resource usage by slurm jobs
Nicolas Granger
nicolas.granger at cea.fr
Mon Jan 8 10:30:39 UTC 2024
Hi all,
Happy new year everyone!
I've been looking for a simple tool that reports how much resources are
actually consumed by a job to help my colleagues and I adjust job
requirements. I could not find such a tool, or the ones mentioned on
this ML were not easy to install and use, so I have written a new one:
https://github.com/CEA-LIST/sprofile
It's a simple python script which parses cgroup and nvml data from the
nvidia driver. It reports duration, cpu load, peak RAM, GPU load and
peak GPU memory like so:
|-- sprofile report (node03) -- Time: 0:00:03 / 1:00:00 CPU load: 2.0 /
4.0 RAM peak mem: 7G / 8G GPU load: 0.2 / 2.0 GPU peak mem: 7G / 40G|
The requirements are to use the slurm cgroup plugin and to enable
accounting on the GPU (nvidia-smi --accounting-mode=1).
I hope you find this useful and let me know I you find bugs or want to
contribute.
Regards,
Nicolas Granger
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20240108/aa60e40b/attachment.htm>
More information about the slurm-users
mailing list