【发布时间】:2021-01-24 14:58:07
【问题描述】:
我正在使用 OpenMPI,我需要启用 CUDA 感知 MPI。与 MPI 一起,我将 OpenACC 与 hpc_sdk 软件一起使用。
在https://www.open-mpi.org/faq/?category=buildcuda之后,我下载并安装了UCX(不是gdrcopy,我还没有安装)
./contrib/configure-release --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/cuda/11.0 CC=pgcc CXX=pgc++ --disable-fortran
然后打印出来:
checking cuda.h usability... yes
checking cuda.h presence... yes
checking for cuda.h... yes
checking cuda_runtime.h usability... yes
checking cuda_runtime.h presence... yes
checking for cuda_runtime.h... yes
所以 UCX 似乎没问题。 在此之后,我重新配置了 OpenMPI:
./configure --with-ucx=/home/marco/Downloads/ucx-1.9.0/install --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/cuda/11.0 CC =pgcc CXX=pgc++ --disable-mpi-fortran
然后打印出来:
CUDA support: yes
Open UCX: yes
如果我尝试使用以下命令运行应用程序:mpirun -np 2 -mca pml ucx -x ./a.out(如 openucx.org 上的建议),我会收到以下错误:
match_arg (utils/args/args.c:163): unrecognized argument mca
HYDU_parse_array (utils/args/args.c:178): argument matching returned error
parse_args (ui/mpich/utils.c:1642): error parsing input array
HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments
main (ui/mpich/mpiexec.c:148): error parsing parameters
我看到编译器正在寻找的目录不是 OpenMPI 的目录,而是 MPICH 的目录,我不知道为什么。如果我输入 which mpicc ,which mpiexec 和 which mpirun 我会得到 OpenMPI 的。
如果我运行:mpiexec -n 2 ./a.out 我得到:
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
已编辑:
做同样的事情,但使用 NVIDIA HPC SDK 附带的 OpenMPI-4.0.5 编译和运行良好。
我明白了:
[marco-Inspiron-7501:1356251:0:1356251] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f05cfafa000)
==== backtrace (tid:1356251) ====
0 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(ucs_handle_error+0x67) [0x7f060ae06dc7]
1 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ab87) [0x7f060ae06b87]
2 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ace4) [0x7f060ae06ce4]
3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f060c7433c0]
4 /lib/x86_64-linux-gnu/libc.so.6(+0x18e885) [0x7f060befb885]
5 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x379e6) [0x7f060b2bd9e6]
6 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_dt_pack+0xa5) [0x7f060b2bd775]
7 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d5b5) [0x7f060b2d35b5]
8 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b11d) [0x7f060b2d111d]
9 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(+0x1b577) [0x7f060b055577]
10 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x75) [0x7f060b054725]
11 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d614) [0x7f060b2d3614]
12 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4c2c7) [0x7f060b2d22c7]
13 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b5b1) [0x7f060b2d15b1]
14 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x625bd) [0x7f060b2e85bd]
15 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x61d15) [0x7f060b2e7d15]
16 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x6121a) [0x7f060b2e721a]
17 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_tag_send_nbx+0x5ec) [0x7f060b2e65ac]
18 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libmpi.so.40(mca_pml_ucx_send+0x1a3) [0x7f060dfc3b33]
=================================
[marco-Inspiron-7501:1356252:0:1356252] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd7f7afa000)
==== backtrace (tid:1356252) ====
0 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(ucs_handle_error+0x67) [0x7fd82a711dc7]
1 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ab87) [0x7fd82a711b87]
2 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ace4) [0x7fd82a711ce4]
3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fd82c04e3c0]
4 /lib/x86_64-linux-gnu/libc.so.6(+0x18e885) [0x7fd82b806885]
5 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x379e6) [0x7fd82abc89e6]
6 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_dt_pack+0xa5) [0x7fd82abc8775]
7 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d5b5) [0x7fd82abde5b5]
8 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b11d) [0x7fd82abdc11d]
9 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(+0x1b577) [0x7fd82a960577]
10 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x75) [0x7fd82a95f725]
11 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d614) [0x7fd82abde614]
12 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4c2c7) [0x7fd82abdd2c7]
13 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b5b1) [0x7fd82abdc5b1]
14 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x625bd) [0x7fd82abf35bd]
15 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x61d15) [0x7fd82abf2d15]
16 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x6121a) [0x7fd82abf221a]
17 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_tag_send_nbx+0x5ec) [0x7fd82abf15ac]
18 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libmpi.so.40(mca_pml_ucx_send+0x1a3) [0x7fd82d8ceb33]
=================================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node marco-Inspiron-7501 exited on signal 11 (Segmentation fault).
由pragma acc host_data use_device(send_buf, recv_buf)引起的错误
double send_buf[NX_GLOB + 2*NGHOST];
double recv_buf[NX_GLOB + 2*NGHOST];
#pragma acc enter data create(send_buf[:NX_GLOB+2*NGHOST], recv_buf[NX_GLOB+2*NGHOST])
// Top buffer
j = jend;
#pragma acc parallel loop present(phi[:ny_tot][:nx_tot], send_buf[:NX_GLOB+2*NGHOST])
for (i = ibeg; i <= iend; i++) send_buf[i] = phi[j][i];
#pragma acc host_data use_device(send_buf, recv_buf)
{
MPI_Sendrecv (send_buf, iend+1, MPI_DOUBLE, procR[1], 0,
recv_buf, iend+1, MPI_DOUBLE, procR[1], 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
【问题讨论】: