[slurm-users] GPU job still running after SLURM job is killed
Matt McKinnon
matt at techsquare.com
Wed Nov 22 06:55:51 MST 2017
Hi All,
I'm wondering if you've seen this issue around, I can't seem to find
anything on it:
We have an NVIDIA DGX-1 that we run SLURM on in order to queue up jobs
on the GPU's there, but we're running into an issue:
1) launch a SLURM job (assume job id = 12345)
2) start a program that runs on GPUs and writes continuously to
disk (e.g., to ~/test.txt)
3) kill the process in another terminal with the command scancel 12345
You would see that although the job 12345 has been killed, the file
~/test.txt is still being written to, and that the GPU memory taken up
by job 12345 is still not released.
Have you seen anything like this? Trying to figure out if it's a SLURM
issue, or a GPU issue. We're running SLURM 17.02.7.
-Matt
More information about the slurm-users
mailing list