【发布时间】:2018-11-05 07:18:51
【问题描述】:
我正在尝试在 NVIDIA Jetson TX2 上运行 OpenMPI 代码。但是当我运行 mpiexec 时出现 OPAL 错误。
编译说明:
$ nvcc -I/home/user/.openmpi/include/ -L/home/user/.openmpi/lib/ -lmpi -std=c++11 *.cu *.cpp -o program
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
执行错误信息:
$ mpiexec -np 4 ./program
[user:05728] OPAL ERROR: Not initialized in file pmix2x_client.c at line 109
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[user:05728] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[user:05729] OPAL ERROR: Not initialized in file pmix2x_client.c at line 109
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[user:05729] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[7361,1],0]
Exit code: 1
--------------------------------------------------------------------------
我使用以下说明安装了 OpenMPI 3.1.2 版:
$ ./configure --prefix="/home/user/.openmpi" --with-cuda
$ make; sudo make install
我还根据此link 的说明相应地设置了我的$PATH 和$LD_LIBRARY_PATH 变量
我能够在我的笔记本电脑(Intel i7)上成功执行该程序。在查找错误后,我发现一些链接建议我重新安装 OpenMPI。我曾多次尝试这样做(包括重新下载库),但均未成功。
任何帮助将不胜感激!
编辑
我尝试按照 cmets 中的要求运行以下最小代码 (main.cpp):
#include <iostream>
#include "mpi.h"
#include <string>
int main(int argc, char *argv[]) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
std::cout << rank << '\n';
MPI_Finalize();
return 0;
}
为了编译这个,我重新运行了之前的命令,得到了同样的错误:
$ nvcc -I/home/user/.openmpi/include/ -L/home/user/.openmpi/lib/ -lmpi -std=c++11 main.cpp -o program
但是如果我用mpic++ 编译它,它就可以运行得很好。
$ mpic++ main.cpp -o ./program
$ mpiexec -np 4 ./program
0
1
3
2
【问题讨论】:
-
你能
mpiexec -n 4 hello_c吗?来源在examples/hello_c.c -
你对 MPI_INIT 做了什么?
-
@MatthieuBrucher 仅声明等级和大小。我的主要功能如下:
int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv);。我在整个过程中使用了#include其他一些.cu和.cpp文件,但我假设您只想要在MPI_Init()调用之前主要发生的事情。 -
@GillesGouaillardet 查看编辑
-
可能是 nvcc 没有链接到正确的 .so 吗?直接试试/home/user/.openmpi/lib/libmpi.so
标签: compiler-errors mpi openmpi nvcc nvidia-jetson