[slurm-users] GPU job still running after SLURM job is killed

Matt McKinnon matt at techsquare.com
Wed Nov 22 06:55:51 MST 2017


Hi All,

I'm wondering if you've seen this issue around, I can't seem to find 
anything on it:

We have an NVIDIA DGX-1 that we run SLURM on in order to queue up jobs 
on the GPU's there, but we're running into an issue:

     1) launch a SLURM job (assume job id = 12345)

     2) start a program that runs on GPUs and writes continuously to 
disk (e.g., to ~/test.txt)

     3) kill the process in another terminal with the command scancel 12345

You would see that although the job 12345 has been killed, the file 
~/test.txt is still being written to, and that the GPU memory taken up 
by job 12345 is still not released.

Have you seen anything like this?  Trying to figure out if it's a SLURM 
issue, or a GPU issue.  We're running SLURM 17.02.7.

-Matt



More information about the slurm-users mailing list