[slurm-users] pmix and ucx(IB) testing fails with error "Cannot get polling fd"

huml1 at sugon.com huml1 at sugon.com
Sat Nov 9 15:24:16 UTC 2019


Hi,
When I was testing slurm-19.05.3 with openmpi-4.0.1 、pmix-3.1.3rc4 and ucx-1.6.1(with IB) ,I got a different error unlike Bug 7646(https://bugs.schedmd.com/show_bug.cgi?id=7646).At first , the job like "srun --mpi=pmix_v3 xxx" could run with "SLURM_PMIX_DIRECT_CONN=true" and "SLURM_PMIX_DIRECT_CONN_UCX=false",but the job was ended immediately when "SLURM_PMIX_DIRECT_CONN_UCX=true" was configured.  And I got an error like "Cannot get polling fd" after "Fail to create UCX worker: Input/output error" 
I've confirmed that the error msg comes from ucp_worker_create. so I tried to get config  after ucp_config_read("SLURM", NULL, &config) and ucp_context after ucp_init(&ucp_params, config, &ucp_context).
UCX_NET_DEVICES=all
UCX_SHM_DEVICES=all
UCX_ACC_DEVICES=all
UCX_SELF_DEVICES=all
UCX_TLS=all
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_SOCKADDR_AUX_TLS=ud,ud_x
UCX_WARN_INVALID_CONFIG=y
UCX_BCOPY_THRESH=0
UCX_RNDV_THRESH=auto
UCX_RNDV_SEND_NBR_THRESH=256k
UCX_RNDV_THRESH_FALLBACK=inf
UCX_RNDV_PERF_DIFF=1.000
UCX_MAX_EAGER_RAILS=1
UCX_MAX_RNDV_RAILS=1
UCX_RNDV_SCHEME=auto
UCX_ZCOPY_THRESH=auto
UCX_BCOPY_BW=5800m
UCX_ATOMIC_MODE=guess
UCX_MAX_WORKER_NAME=32
UCX_USE_MT_MUTEX=n
UCX_ADAPTIVE_PROGRESS=y
UCX_SEG_SIZE=8k
UCX_TM_THRESH=1k
UCX_TM_MAX_BB_SIZE=1k
UCX_TM_FORCE_THRESH=8k
UCX_NUM_EPS=auto
UCX_RNDV_FRAG_SIZE=256k
UCX_MEMTYPE_CACHE=y
UCX_FLUSH_WORKER_EPS=y
UCX_UNIFIED_MODE=n
hello 2
UCX_NET_DEVICES=all
UCX_SHM_DEVICES=all
UCX_ACC_DEVICES=all
UCX_SELF_DEVICES=all
UCX_TLS=all
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_SOCKADDR_AUX_TLS=ud,ud_x
UCX_WARN_INVALID_CONFIG=y
UCX_BCOPY_THRESH=0
UCX_RNDV_THRESH=auto
UCX_RNDV_SEND_NBR_THRESH=256k
UCX_RNDV_THRESH_FALLBACK=inf
UCX_RNDV_PERF_DIFF=1.000
UCX_MAX_EAGER_RAILS=1
UCX_MAX_RNDV_RAILS=1
UCX_RNDV_SCHEME=auto
UCX_ZCOPY_THRESH=auto
UCX_BCOPY_BW=5800m
UCX_ATOMIC_MODE=guess
UCX_MAX_WORKER_NAME=32
UCX_USE_MT_MUTEX=n
UCX_ADAPTIVE_PROGRESS=y
UCX_SEG_SIZE=8k
UCX_TM_THRESH=1k
UCX_TM_MAX_BB_SIZE=1k
UCX_TM_FORCE_THRESH=8k
UCX_NUM_EPS=auto
UCX_RNDV_FRAG_SIZE=256k
UCX_MEMTYPE_CACHE=y
UCX_FLUSH_WORKER_EPS=y
UCX_UNIFIED_MODE=n
#
# UCP context
#
#            md 0  :  self
#            md 1  :  tcp
#            md 2  :  ib/mlx5_3
#            md 3  :  ib/mlx5_2
#            md 4  :  ib/mlx5_1
#            md 5  :  ib/mlx5_0
#            md 6  :  rdmacm
#            md 7  :  sysv
#            md 8  :  posix
#            md 9  :  cma
#            md 10 :  knem
#
#      resource 0  :  md 0  dev 0  flags -- self/self
#      resource 1  :  md 1  dev 1  flags -- tcp/ib0
#      resource 2  :  md 1  dev 2  flags -- tcp/eno1
#      resource 3  :  md 2  dev 3  flags -- rc/mlx5_3:1
#      resource 4  :  md 2  dev 3  flags -- rc_mlx5/mlx5_3:1
#      resource 5  :  md 2  dev 3  flags -- dc/mlx5_3:1
#      resource 6  :  md 2  dev 3  flags -- dc_mlx5/mlx5_3:1
#      resource 7  :  md 2  dev 3  flags -- ud/mlx5_3:1
#      resource 8  :  md 2  dev 3  flags -- ud_mlx5/mlx5_3:1
#      resource 9  :  md 2  dev 3  flags -- cm/mlx5_3:1
#      resource 10 :  md 3  dev 4  flags -- rc/mlx5_2:1
#      resource 11 :  md 3  dev 4  flags -- rc_mlx5/mlx5_2:1
#      resource 12 :  md 3  dev 4  flags -- dc/mlx5_2:1
#      resource 13 :  md 3  dev 4  flags -- dc_mlx5/mlx5_2:1
#      resource 14 :  md 3  dev 4  flags -- ud/mlx5_2:1
#      resource 15 :  md 3  dev 4  flags -- ud_mlx5/mlx5_2:1
#      resource 16 :  md 3  dev 4  flags -- cm/mlx5_2:1
#      resource 17 :  md 4  dev 5  flags -- rc/mlx5_1:1
#      resource 18 :  md 4  dev 5  flags -- rc_mlx5/mlx5_1:1
#      resource 19 :  md 4  dev 5  flags -- dc/mlx5_1:1
#      resource 20 :  md 4  dev 5  flags -- dc_mlx5/mlx5_1:1
#      resource 21 :  md 4  dev 5  flags -- ud/mlx5_1:1
#      resource 22 :  md 4  dev 5  flags -- ud_mlx5/mlx5_1:1
#      resource 23 :  md 4  dev 5  flags -- cm/mlx5_1:1
#      resource 24 :  md 5  dev 6  flags -- rc/mlx5_0:1
#      resource 25 :  md 5  dev 6  flags -- rc_mlx5/mlx5_0:1
#      resource 26 :  md 5  dev 6  flags -- dc/mlx5_0:1
#      resource 27 :  md 5  dev 6  flags -- dc_mlx5/mlx5_0:1
#      resource 28 :  md 5  dev 6  flags -- ud/mlx5_0:1
#      resource 29 :  md 5  dev 6  flags -- ud_mlx5/mlx5_0:1
#      resource 30 :  md 5  dev 6  flags -- cm/mlx5_0:1
#      resource 31 :  md 6  dev 7  flags -s rdmacm/sockaddr
#      resource 32 :  md 7  dev 8  flags -- mm/sysv
#      resource 33 :  md 8  dev 9  flags -- mm/posix
#      resource 34 :  md 9  dev 10 flags -- cma/cma
#      resource 35 :  md 10 dev 11 flags -- knem/knem
#

Looking forward to your reply.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191109/acc42e9a/attachment.htm>


More information about the slurm-users mailing list