[slurm-users] GPU job still running after SLURM job is killed
hearnsj at gmail.com
Wed Nov 22 07:13:44 MST 2017
I saw a similar situation with a PBS job recently.
A process with is writing to disk cannot be killed (it is in S state). So
the job ended but PBS logged that it could not kill the process.
I would look in detail at the slurm logs at the point where that job is
being killed, and you might get some information.
I guess this depends on the method which Slurm uses to kill a job.
Of course this could be a completely different scanario.
On 22 November 2017 at 14:55, Matt McKinnon <matt at techsquare.com> wrote:
> Hi All,
> I'm wondering if you've seen this issue around, I can't seem to find
> anything on it:
> We have an NVIDIA DGX-1 that we run SLURM on in order to queue up jobs on
> the GPU's there, but we're running into an issue:
> 1) launch a SLURM job (assume job id = 12345)
> 2) start a program that runs on GPUs and writes continuously to disk
> (e.g., to ~/test.txt)
> 3) kill the process in another terminal with the command scancel 12345
> You would see that although the job 12345 has been killed, the file
> ~/test.txt is still being written to, and that the GPU memory taken up by
> job 12345 is still not released.
> Have you seen anything like this? Trying to figure out if it's a SLURM
> issue, or a GPU issue. We're running SLURM 17.02.7.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users