[slurm-users] [External] Re: srun : Communication connection failure

Tue Jan 25 13:41:31 UTC 2022

Hello Mike,Doug:

The issue was resolved somehow. My colleagues says the addresses in
slurm.conf on the login nodes were incorrect. It could also have been a
temporary network issue.

Best,
Durai Arasan
MPI Tübingen

On Fri, Jan 21, 2022 at 2:15 PM Doug Meyer <dameyer99 at gmail.com> wrote:

> Hi,
> Did you recently add nodes?  We have seen that when we add nodes past the
> treewidth count the most recently added nodes will lose communication
> (asterisk next to node name in sifo).  We have to ensure the treewidth
> declaration in the slurm.conf matches or exceeds the number of nodes.
>
> Doug
>
> On Fri, Jan 21, 2022 at 4:33 AM Durai Arasan <arasan.durai at gmail.com>
> wrote:
>
>> Hello MIke,
>>
>> I am able to ping the nodes from the slurm master without any problem.
>> Actually there is nothing interesting in slurmctld.log or slurmd.log. You
>> can trust me on this. That is why I posted here.
>>
>> Best,
>> Durai Arasan
>> MPI Tuebingen
>>
>> On Thu, Jan 20, 2022 at 5:08 PM Michael Robbert <mrobbert at mines.edu>
>> wrote:
>>
>>> It looks like it could be some kind of network problem but could be DNS.
>>> Can you ping and do DNS resolution for the host involved?
>>>
>>> What does slurmctld.log say? How about slurmd.log on the node in
>>> question?
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf
>>> of Durai Arasan <arasan.durai at gmail.com>
>>> *Date: *Thursday, January 20, 2022 at 08:08
>>> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
>>> *Subject: *[External] Re: [slurm-users] srun : Communication connection
>>> failure
>>>
>>> *CAUTION:* This email originated from outside of the Colorado School of
>>> Mines organization. Do not click on links or open attachments unless you
>>> recognize the sender and know the content is safe.
>>>
>>>
>>>
>>> Hello slurm users,
>>>
>>>
>>>
>>> I forgot to mention that an identical interactive job works successfully
>>> on the gpu partitions (in the same cluster). So this is really puzzling.
>>>
>>>
>>>
>>> Best,
>>>
>>> Durai Arasan
>>>
>>> MPI Tuebingen
>>>
>>>
>>>
>>> On Thu, Jan 20, 2022 at 3:40 PM Durai Arasan <arasan.durai at gmail.com>
>>> wrote:
>>>
>>> Hello Slurm users,
>>>
>>>
>>>
>>> We are suddenly encountering strange errors while trying to launch
>>> interactive jobs on our cpu partitions. Have you encountered this problem
>>> before? Kindly let us know.
>>>
>>>
>>>
>>> [darasan84 at bg-slurmb-login1 ~]$ srun --job-name "admin_test231"
>>> --ntasks=1 --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G
>>>  --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash
>>> srun: error: Task launch for StepId=1137134.0 failed on node
>>> slurm-cpu-hm-7: Communication connection failure
>>> srun: error: Application launch failed: Communication connection failure
>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>> srun: error: Timed out waiting for job step to complete
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Durai Arasan
>>>
>>> MPI Tuebingen
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220125/5b3c3a16/attachment.htm>