[slurm-users] Automatically cancel jobs not utilizing their GPUs

Stephan Roth stephan.roth at ee.ethz.ch
Thu Jul 2 06:56:44 UTC 2020


Hi all,

Does anyone have ideas or suggestions on how to automatically cancel 
jobs which don't utilize the GPUs allocated to them?

The Slurm version in use is 19.05.

I'm thinking about collecting GPU utilization per process on all nodes 
with NVML/nvidia-smi, update a mean value of the collected utilization 
per GPU and cancel a job if the mean value is below a to-be-defined 
threshold after a to-be-defined amount of minutes.

Thank you for any input,

Cheers,
Stephan



More information about the slurm-users mailing list