[slurm-users] Pending job will be cancelled since transport endpoint is not connected

Chenyang Yan memory.yancy at gmail.com
Fri Mar 31 07:10:40 UTC 2023


Hi, all

We have one cluster with Slurm version 20.11.8 in CentOS 8.2. Suddenly it
produces a wired problem proid for *only Pending job* will be cancelled
since transport endpoint is not connected error(See image
https://user-images.githubusercontent.com/19144683/229037078-ca704ba8-23a4-4948-9d1a-bacab82acd1f.png).
The all jobs are submitted with srun command.
... ...
srun:job 6367724 queued and waiting for resources
srun:error:Unable to allocate resources: Transport endpoint is not connected
srun:job 6367725 queued and waiting for resources
srun:error: Unable to allocate resources: Transport endpoint is not
connected
srun:job 6367726 queued and waiting for resources
srun:job 6367727 queued and waiting for resources
srun:job 6367728 queued and waiting for resources
srun:error: Unable to allocate resources: Transport endpoint is not
connected
srun:Force Terminated job 6366908

[root at slurm-master01 bin]# journalctl --since today -p err __COMM=slurmctld
Mar 31 02:50:46 slurm-master01 slurmctld[220654]: error:
slurm_receive_msgs: Transport endpoint is not connected
Mar 31 02:50:47 slurm-master01 slurmctld[220654]: error: slurm
receive_msgs: Transport endpoint is not connected

According to
https://github.com/SchedMD/slurm/blob/slurm-20-11-8-1/src/srun/libsrun/allocate.c#L182-L227
, it seems OS issue? I've google for "transport endpoint is not connected",
lots of references report that filesystem IO issue.So:
* How to avoid pending job will be cancelled for slurm
* What caused the slurmctld reported error

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230331/9855cdd6/attachment.htm>


More information about the slurm-users mailing list