[slurm-users] GPU job still running after SLURM job is killed

Wed Nov 22 07:13:44 MST 2017

Matt,
  I saw a similar situation with a PBS job recently.
A process with is writing to disk cannot be killed (it is in S state). So
the job ended but PBS logged that it could not kill the process.
I would look in detail at the slurm logs at the point where that job is
being killed, and you might get some information.
I guess this depends on the method which Slurm uses to kill a job.

Of course this could be a completely different scanario.

On 22 November 2017 at 14:55, Matt McKinnon <matt at techsquare.com> wrote:

> Hi All,
>
> I'm wondering if you've seen this issue around, I can't seem to find
> anything on it:
>
> We have an NVIDIA DGX-1 that we run SLURM on in order to queue up jobs on
> the GPU's there, but we're running into an issue:
>
>     1) launch a SLURM job (assume job id = 12345)
>
>     2) start a program that runs on GPUs and writes continuously to disk
> (e.g., to ~/test.txt)
>
>     3) kill the process in another terminal with the command scancel 12345
>
> You would see that although the job 12345 has been killed, the file
> ~/test.txt is still being written to, and that the GPU memory taken up by
> job 12345 is still not released.
>
> Have you seen anything like this?  Trying to figure out if it's a SLURM
> issue, or a GPU issue.  We're running SLURM 17.02.7.
>
> -Matt
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171122/33ad53c0/attachment.html>