[slurm-users] Job with srun is still RUNNING after node reboot

Tue Mar 31 14:19:25 UTC 2020

Hi, Yair,

Out of curiosity have you checked to see if this is a runaway job?

David

On Tue, Mar 31, 2020 at 7:49 AM Yair Yarom <irush at cs.huji.ac.il> wrote:

> Hi,
>
> We have an issue where running srun (with --pty zsh), and rebooting the
> node (from a different shell), the srun reports:
> srun: error: eio_message_socket_accept: slurm_receive_msg[an.ip.addr.ess]:
> Zero Bytes were transmitted or received
> and hangs.
>
> After the node boots, the slurm claims that job is still RUNNING, and srun
> is still alive (but not responsive).
>
> I've tried it with various configurations (select/linear,
> select/cons_tres, jobacct_gather/linux, jobacct_gather/cgroup, task/none,
> task/cgroup), with the same results. We're using 19.05.1.
> Running with sbatch causes the job to be in the more appropriate NODE_FAIL
> state instead.
>
> Anyone else encountered this? or know how to make the job state not
> RUNNING after it's clearly not running?
>
> Thanks in advance,
>     Yair.
>
>

-- 
David Rhey
---------------
Advanced Research Computing - Technology Services
University of Michigan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200331/21ac2383/attachment.htm>