[slurm-users] [External] Re: srun : Communication connection failure
mrobbert at mines.edu
Thu Jan 20 16:05:32 UTC 2022
It looks like it could be some kind of network problem but could be DNS. Can you ping and do DNS resolution for the host involved?
What does slurmctld.log say? How about slurmd.log on the node in question?
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Durai Arasan <arasan.durai at gmail.com>
Date: Thursday, January 20, 2022 at 08:08
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: [External] Re: [slurm-users] srun : Communication connection failure
CAUTION: This email originated from outside of the Colorado School of Mines organization. Do not click on links or open attachments unless you recognize the sender and know the content is safe.
Hello slurm users,
I forgot to mention that an identical interactive job works successfully on the gpu partitions (in the same cluster). So this is really puzzling.
On Thu, Jan 20, 2022 at 3:40 PM Durai Arasan <arasan.durai at gmail.com<mailto:arasan.durai at gmail.com>> wrote:
Hello Slurm users,
We are suddenly encountering strange errors while trying to launch interactive jobs on our cpu partitions. Have you encountered this problem before? Kindly let us know.
[darasan84 at bg-slurmb-login1 ~]$ srun --job-name "admin_test231" --ntasks=1 --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash
srun: error: Task launch for StepId=1137134.0 failed on node slurm-cpu-hm-7: Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users