[slurm-users] Job with srun is still RUNNING after node reboot

Wed Apr 1 08:05:20 UTC 2020

I've checked it now, it isn't listed as a runaway job.

On Tue, Mar 31, 2020 at 5:24 PM David Rhey <drhey at umich.edu> wrote:

> Hi, Yair,
>
> Out of curiosity have you checked to see if this is a runaway job?
>
> David
>
> On Tue, Mar 31, 2020 at 7:49 AM Yair Yarom <irush at cs.huji.ac.il> wrote:
>
>> Hi,
>>
>> We have an issue where running srun (with --pty zsh), and rebooting the
>> node (from a different shell), the srun reports:
>> srun: error: eio_message_socket_accept:
>> slurm_receive_msg[an.ip.addr.ess]: Zero Bytes were transmitted or received
>> and hangs.
>>
>> After the node boots, the slurm claims that job is still RUNNING, and
>> srun is still alive (but not responsive).
>>
>> I've tried it with various configurations (select/linear,
>> select/cons_tres, jobacct_gather/linux, jobacct_gather/cgroup, task/none,
>> task/cgroup), with the same results. We're using 19.05.1.
>> Running with sbatch causes the job to be in the more appropriate
>> NODE_FAIL state instead.
>>
>> Anyone else encountered this? or know how to make the job state not
>> RUNNING after it's clearly not running?
>>
>> Thanks in advance,
>>     Yair.
>>
>>
>
> --
> David Rhey
> ---------------
> Advanced Research Computing - Technology Services
> University of Michigan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200401/2c39d0e3/attachment.htm>