Hi all,
Happy new year everyone!
I've been looking for a simple tool that reports how much resources are actually consumed by a job to help my colleagues and I adjust job requirements. I could not find such a tool, or the ones mentioned on this ML were not easy to install and use, so I have written a new one: https://github.com/CEA-LIST/sprofile
It's a simple python script which parses cgroup and nvml data from the nvidia driver. It reports duration, cpu load, peak RAM, GPU load and peak GPU memory like so:
|-- sprofile report (node03) -- Time: 0:00:03 / 1:00:00 CPU load: 2.0 / 4.0 RAM peak mem: 7G / 8G GPU load: 0.2 / 2.0 GPU peak mem: 7G / 40G|
The requirements are to use the slurm cgroup plugin and to enable accounting on the GPU (nvidia-smi --accounting-mode=1).
I hope you find this useful and let me know I you find bugs or want to contribute.
Regards, Nicolas Granger