[slurm-users] Tool for profiling resource usage by slurm jobs

Nicolas Granger nicolas.granger at cea.fr
Mon Jan 8 10:30:39 UTC 2024


Hi all,

Happy new year everyone!

I've been looking for a simple tool that reports how much resources are 
actually consumed by a job to help my colleagues and I adjust job 
requirements. I could not find such a tool, or the ones mentioned on 
this ML were not easy to install and use, so I have written a new one: 
https://github.com/CEA-LIST/sprofile

It's a simple python script which parses cgroup and nvml data from the 
nvidia driver. It reports duration, cpu load, peak RAM, GPU load and 
peak GPU memory like so:

|-- sprofile report (node03) -- Time: 0:00:03 / 1:00:00 CPU load: 2.0 / 
4.0 RAM peak mem: 7G / 8G GPU load: 0.2 / 2.0 GPU peak mem: 7G / 40G|

The requirements are to use the slurm cgroup plugin and to enable 
accounting on the GPU (nvidia-smi --accounting-mode=1).

I hope you find this useful and let me know I you find bugs or want to 
contribute.

Regards,
Nicolas Granger
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20240108/aa60e40b/attachment.htm>


More information about the slurm-users mailing list