[slurm-users] srun : Communication connection failure

Ryan Novosielski novosirj at rutgers.edu
Wed Jan 26 00:26:59 UTC 2022


I’m coming to this question late, and this is not the answer to your problem (well, maybe tangentially), but it may help someone else: my recollection is that the compute node that gets assigned the job must be able to contact the node you’re starting the interactive job from (so bg-slurmb-login1 here) on a wide variety of ports in the case of interactive jobs. For us, we had a firewall config that didn’t allow for that and all interactive jobs failed until we resolved that. I guess having the wrong address someplace could a mimic that behavior.

--
#BlackLivesMatter
____
|| \\UTGERS,  	 |---------------------------*O*---------------------------
||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark
     `'

> On Jan 20, 2022, at 9:40 AM, Durai Arasan <arasan.durai at gmail.com> wrote:
> 
> Hello Slurm users,
> 
> We are suddenly encountering strange errors while trying to launch interactive jobs on our cpu partitions. Have you encountered this problem before? Kindly let us know.
> 
> [darasan84 at bg-slurmb-login1 ~]$ srun --job-name "admin_test231" --ntasks=1 --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G  --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash
> srun: error: Task launch for StepId=1137134.0 failed on node slurm-cpu-hm-7: Communication connection failure
> srun: error: Application launch failed: Communication connection failure
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
> 
> Best regards,
> Durai Arasan
> MPI Tuebingen



More information about the slurm-users mailing list