[slurm-users] [External] Re: srun : Communication connection failure
Durai Arasan
arasan.durai at gmail.com
Fri Jan 21 11:30:31 UTC 2022
Hello MIke,
I am able to ping the nodes from the slurm master without any problem.
Actually there is nothing interesting in slurmctld.log or slurmd.log. You
can trust me on this. That is why I posted here.
Best,
Durai Arasan
MPI Tuebingen
On Thu, Jan 20, 2022 at 5:08 PM Michael Robbert <mrobbert at mines.edu> wrote:
> It looks like it could be some kind of network problem but could be DNS.
> Can you ping and do DNS resolution for the host involved?
>
> What does slurmctld.log say? How about slurmd.log on the node in question?
>
>
>
> Mike
>
>
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Durai Arasan <arasan.durai at gmail.com>
> *Date: *Thursday, January 20, 2022 at 08:08
> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject: *[External] Re: [slurm-users] srun : Communication connection
> failure
>
> *CAUTION:* This email originated from outside of the Colorado School of
> Mines organization. Do not click on links or open attachments unless you
> recognize the sender and know the content is safe.
>
>
>
> Hello slurm users,
>
>
>
> I forgot to mention that an identical interactive job works successfully
> on the gpu partitions (in the same cluster). So this is really puzzling.
>
>
>
> Best,
>
> Durai Arasan
>
> MPI Tuebingen
>
>
>
> On Thu, Jan 20, 2022 at 3:40 PM Durai Arasan <arasan.durai at gmail.com>
> wrote:
>
> Hello Slurm users,
>
>
>
> We are suddenly encountering strange errors while trying to launch
> interactive jobs on our cpu partitions. Have you encountered this problem
> before? Kindly let us know.
>
>
>
> [darasan84 at bg-slurmb-login1 ~]$ srun --job-name "admin_test231"
> --ntasks=1 --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G
> --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash
> srun: error: Task launch for StepId=1137134.0 failed on node
> slurm-cpu-hm-7: Communication connection failure
> srun: error: Application launch failed: Communication connection failure
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
>
>
>
> Best regards,
>
> Durai Arasan
>
> MPI Tuebingen
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220121/67569682/attachment-0001.htm>
More information about the slurm-users
mailing list