[slurm-users] Job with srun is still RUNNING after node reboot

Tue Mar 31 11:47:05 UTC 2020

Hi,

We have an issue where running srun (with --pty zsh), and rebooting the
node (from a different shell), the srun reports:
srun: error: eio_message_socket_accept: slurm_receive_msg[an.ip.addr.ess]:
Zero Bytes were transmitted or received
and hangs.

After the node boots, the slurm claims that job is still RUNNING, and srun
is still alive (but not responsive).

I've tried it with various configurations (select/linear, select/cons_tres,
jobacct_gather/linux, jobacct_gather/cgroup, task/none, task/cgroup), with
the same results. We're using 19.05.1.
Running with sbatch causes the job to be in the more appropriate NODE_FAIL
state instead.

Anyone else encountered this? or know how to make the job state not RUNNING
after it's clearly not running?

Thanks in advance,
    Yair.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200331/f8e8b383/attachment.htm>