[slurm-users] srun : Communication connection failure

Durai Arasan arasan.durai at gmail.com
Thu Jan 20 15:06:14 UTC 2022


Hello slurm users,

I forgot to mention that an identical interactive job works successfully on
the gpu partitions (in the same cluster). So this is really puzzling.

Best,
Durai Arasan
MPI Tuebingen

On Thu, Jan 20, 2022 at 3:40 PM Durai Arasan <arasan.durai at gmail.com> wrote:

> Hello Slurm users,
>
> We are suddenly encountering strange errors while trying to launch
> interactive jobs on our cpu partitions. Have you encountered this problem
> before? Kindly let us know.
>
> [darasan84 at bg-slurmb-login1 ~]$ srun --job-name "admin_test231"
> --ntasks=1 --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G
>  --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash
> srun: error: Task launch for StepId=1137134.0 failed on node
> slurm-cpu-hm-7: Communication connection failure
> srun: error: Application launch failed: Communication connection failure
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
>
> Best regards,
> Durai Arasan
> MPI Tuebingen
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220120/e63bc6e2/attachment.htm>


More information about the slurm-users mailing list