[slurm-users] lots of job failed due to node failure
肖正刚
guru.novice at gmail.com
Wed Jul 22 23:47:02 UTC 2020
we checked the slurmd.log,and found "error: service_connection:
slurm_receive_msg: Socket timed out on send/recv operation" when job
failed, so maybe this is the reason?
Sarlo, Jeffrey S <JSarlo at central.uh.edu> 于2020年7月22日周三 下午9:52写道:
> OK.
>
> Though it does look like both were down for around 5 minutes
>
> [2020-07-20T00:21:23.306] error: Nodes j[1608,1802] not responding
> [2020-07-20T00:26:46.602] Node j1608 now responding
> [2020-07-20T00:26:49.449] Node j1802 now responding
>
>
> You might want to check the slurmd.log file on the compute nodes
> themselves and see if there is more information there.
>
> ------------------------------
> *From:* 肖正刚 <guru.novice at gmail.com>
> *Sent:* Wednesday, July 22, 2020 8:46 AM
> *To:* Sarlo, Jeffrey S <JSarlo at Central.UH.EDU>
> *Subject:* Re: [slurm-users] lots of job failed due to node failure
>
> nodes not rebooted/crashed.
> and from the log you can see node j1802 status resumed within one minutes.
>
> Sarlo, Jeffrey S <JSarlo at central.uh.edu> 于2020年7月22日周三 下午7:58写道:
>
> If you log into a node after you see that, had the node rebooted/crashed?
> Maybe a job is crashing the node or there is a hardware issue with the node.
>
> Jeff
>
> ------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> 肖正刚 <guru.novice at gmail.com>
> *Sent:* Tuesday, July 21, 2020 7:40 PM
> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] lots of job failed due to node failure
>
> Hi,all
> We run slurm 19.05 on a cluster about 1k nodes,recently, we found lots of
> job failed due to node failure; check slumctld.log we found nodes are set
> to down stat then resumed quikly.
> some log info:
> [2020-07-20T00:21:23.306] error: Nodes j[1608,1802] not responding
> [2020-07-20T00:22:27.486] error: Nodes j1608 not responding, setting DOWN
> [2020-07-20T00:26:23.725] error: Nodes j1802 not responding
> [2020-07-20T00:26:27.323] error: Nodes j1802 not responding, setting DOWN
> [2020-07-20T00:26:46.602] Node j1608 now responding
> [2020-07-20T00:26:49.449] Node j1802 now responding
>
> Anyone hit this issue beforce ?
> Any suggestions will help.
>
> Regards.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200723/3827d8c4/attachment.htm>
More information about the slurm-users
mailing list