【发布时间】:2018-02-16 23:38:03
【问题描述】:
我在一个非常简单的 MPI 程序中应用 valgrind 时遇到了段错误:
#include "mpi.h"
#include <iostream>
#include<stdio.h>
#include<stdlib.h>
int main(int argc, char *argv[])
{
// Initialize parallel
int rank, numProcess;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numProcess);
std::cout << "Hello world, - Rank " << rank << "\n";
MPI_Finalize();
return 0;
}
调用mpirun -np 2 ./mpi_test 可以正常工作。但是,mpirun -np 2 valgrind ./mpi_test 返回一长串错误并且没有说 Hello world。我知道 valgrind 可以检测 MPI 中的误报,但在这里它甚至不会运行一个简单的 hello world 程序。以下是我得到的错误。
==85595== Memcheck, a memory error detector
==85595== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==85595== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==85595== Command: ./mpi_test
==85595==
==85596== Memcheck, a memory error detector
==85596== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==85596== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==85596== Command: ./mpi_test
==85596==
==85596== Syscall param msg->desc.port.name points to uninitialised byte(s)
==85596== at 0x10070B34A: mach_msg_trap (in /usr/lib/system/libsystem_kernel.dylib)
==85596== by 0x10070A796: mach_msg (in /usr/lib/system/libsystem_kernel.dylib)
==85596== by 0x100704485: task_set_special_port (in /usr/lib/system/libsystem_kernel.dylib)
==85596== by 0x1008A010E: _os_trace_create_debug_control_port (in /usr/lib/system/libsystem_trace.dylib)
==85596== by 0x1008A0458: _libtrace_init (in /usr/lib/system/libsystem_trace.dylib)
==85596== by 0x10026B9DF: libSystem_initializer (in /usr/lib/libSystem.B.dylib)
==85596== by 0x10001AA1A: ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) (in /usr/lib/dyld)
==85596== by 0x10001AC1D: ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) (in /usr/lib/dyld)
==85596== by 0x1000164A9: ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
==85596== by 0x100016440: ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
==85596== by 0x100015523: ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
==85596== by 0x1000155B8: ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) (in /usr/lib/dyld)
==85596== Address 0x10488d25c is on thread 1's stack
==85596== in frame #2, created by task_set_special_port (???:)
==85596==
==85595== Syscall param msg->desc.port.name points to uninitialised byte(s)
==85595== at 0x10070B34A: mach_msg_trap (in /usr/lib/system/libsystem_kernel.dylib)
==85595== by 0x10070A796: mach_msg (in /usr/lib/system/libsystem_kernel.dylib)
==85595== by 0x100704485: task_set_special_port (in /usr/lib/system/libsystem_kernel.dylib)
==85595== by 0x1008A010E: _os_trace_create_debug_control_port (in /usr/lib/system/libsystem_trace.dylib)
==85595== by 0x1008A0458: _libtrace_init (in /usr/lib/system/libsystem_trace.dylib)
==85595== by 0x10026B9DF: libSystem_initializer (in /usr/lib/libSystem.B.dylib)
==85595== by 0x10001AA1A: ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) (in /usr/lib/dyld)
==85595== by 0x10001AC1D: ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) (in /usr/lib/dyld)
==85595== by 0x1000164A9: ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
==85595== by 0x100016440: ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
==85595== by 0x100015523: ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
==85595== by 0x1000155B8: ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) (in /usr/lib/dyld)
==85595== Address 0x10488d25c is on thread 1's stack
==85595== in frame #2, created by task_set_special_port (???:)
==85595==
--85595-- UNKNOWN task message [id 3445, to mach_task_self(), reply 0x707]
--85596-- UNKNOWN task message [id 3445, to mach_task_self(), reply 0x707]
--85595-- UNKNOWN task message [id 3445, to mach_task_self(), reply 0x707] (repeated 2 times)
--85596-- UNKNOWN task message [id 3445, to mach_task_self(), reply 0x707] (repeated 2 times)
--85596-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option
--85595-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option
--85596-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 2 times)
--85595-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 2 times)
--85596-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 4 times)
--85595-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 4 times)
--85595-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 8 times)
--85596-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 8 times)
==85595== Thread 2:
==85595== Invalid read of size 4
==85595== at 0x100868899: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==85595== by 0x100868886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==85595== by 0x10086808C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==85595== Address 0x18 is not stack'd, malloc'd or (recently) free'd
==85595==
==85595== Invalid read of size 8
==85595== at 0x100866435: _pthread_mutex_lock_slow (in /usr/lib/system/libsystem_pthread.dylib)
==85595== by 0x100560117: dyldGlobalLockAcquire() (in /usr/lib/system/libdyld.dylib)
==85595== by 0x100021F95: ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, ImageLoader::LinkContext const&, void (*)(), void (*)()) (in /usr/lib/dyld)
==85595== by 0x10000986C: dyld::fastBindLazySymbol(ImageLoader**, unsigned long) (in /usr/lib/dyld)
==85595== by 0x100560281: dyld_stub_binder (in /usr/lib/system/libdyld.dylib)
==85595== by 0x100382977: ??? (in /usr/local/Cellar/open-mpi/3.0.0_2/lib/libopen-pal.40.dylib)
==85595== by 0x25805BBB1: ???
==85595== by 0x100868886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==85595== by 0x10086808C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==85595== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==85595==
==85595==
==85595== Process terminating with default action of signal 11 (SIGSEGV)
==85595== Access not within mapped region at address 0x0
==85595== at 0x100866435: _pthread_mutex_lock_slow (in /usr/lib/system/libsystem_pthread.dylib)
==85595== by 0x100560117: dyldGlobalLockAcquire() (in /usr/lib/system/libdyld.dylib)
==85595== by 0x100021F95: ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, ImageLoader::LinkContext const&, void (*)(), void (*)()) (in /usr/lib/dyld)
==85595== by 0x10000986C: dyld::fastBindLazySymbol(ImageLoader**, unsigned long) (in /usr/lib/dyld)
==85595== by 0x100560281: dyld_stub_binder (in /usr/lib/system/libdyld.dylib)
==85595== by 0x100382977: ??? (in /usr/local/Cellar/open-mpi/3.0.0_2/lib/libopen-pal.40.dylib)
==85595== by 0x25805BBB1: ???
==85595== by 0x100868886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==85595== by 0x10086808C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==85595== If you believe this happened as a result of a stack
==85595== overflow in your program's main thread (unlikely but
==85595== possible), you can try to increase the size of the
==85595== main thread stack using the --main-stacksize= flag.
==85595== The main thread stack size used in this run was 8388608.
--85595:0:schedule VG_(sema_down): read returned -4
==85595==
==85595== HEAP SUMMARY:
==85595== in use at exit: 358,368 bytes in 3,295 blocks
==85595== total heap usage: 5,625 allocs, 2,330 frees, 721,547 bytes allocated
==85595==
==85596== Thread 2:
==85596== Invalid read of size 4
==85596== at 0x100868899: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==85596== by 0x100868886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==85596== by 0x10086808C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==85596== Address 0x18 is not stack'd, malloc'd or (recently) free'd
==85596==
==85596== Invalid read of size 8
==85596== at 0x100866435: _pthread_mutex_lock_slow (in /usr/lib/system/libsystem_pthread.dylib)
==85596== by 0x100560117: dyldGlobalLockAcquire() (in /usr/lib/system/libdyld.dylib)
==85596== by 0x100021F95: ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, ImageLoader::LinkContext const&, void (*)(), void (*)()) (in /usr/lib/dyld)
==85596== by 0x10000986C: dyld::fastBindLazySymbol(ImageLoader**, unsigned long) (in /usr/lib/dyld)
==85596== by 0x100560281: dyld_stub_binder (in /usr/lib/system/libdyld.dylib)
==85596== by 0x100382977: ??? (in /usr/local/Cellar/open-mpi/3.0.0_2/lib/libopen-pal.40.dylib)
==85596== by 0x25805BBB1: ???
==85596== by 0x100868886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==85596== by 0x10086808C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==85596== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==85596==
==85596==
==85596== Process terminating with default action of signal 11 (SIGSEGV)
==85596== Access not within mapped region at address 0x0
==85596== at 0x100866435: _pthread_mutex_lock_slow (in /usr/lib/system/libsystem_pthread.dylib)
==85596== by 0x100560117: dyldGlobalLockAcquire() (in /usr/lib/system/libdyld.dylib)
==85596== by 0x100021F95: ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, ImageLoader::LinkContext const&, void (*)(), void (*)()) (in /usr/lib/dyld)
==85596== by 0x10000986C: dyld::fastBindLazySymbol(ImageLoader**, unsigned long) (in /usr/lib/dyld)
==85596== by 0x100560281: dyld_stub_binder (in /usr/lib/system/libdyld.dylib)
==85596== by 0x100382977: ??? (in /usr/local/Cellar/open-mpi/3.0.0_2/lib/libopen-pal.40.dylib)
==85596== by 0x25805BBB1: ???
==85596== by 0x100868886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==85596== by 0x10086808C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==85596== If you believe this happened as a result of a stack
==85596== overflow in your program's main thread (unlikely but
==85596== possible), you can try to increase the size of the
==85596== main thread stack using the --main-stacksize= flag.
==85596== The main thread stack size used in this run was 8388608.
--85596:0:schedule VG_(sema_down): read returned -4
==85596==
==85596== HEAP SUMMARY:
==85596== in use at exit: 358,368 bytes in 3,295 blocks
==85596== total heap usage: 5,625 allocs, 2,330 frees, 721,547 bytes allocated
==85596==
==85595== LEAK SUMMARY:
==85595== definitely lost: 9,159 bytes in 47 blocks
==85595== indirectly lost: 8,112 bytes in 111 blocks
==85595== possibly lost: 0 bytes in 0 blocks
==85595== still reachable: 325,270 bytes in 2,982 blocks
==85595== suppressed: 15,827 bytes in 155 blocks
==85595== Rerun with --leak-check=full to see details of leaked memory
==85595==
==85595== For counts of detected and suppressed errors, rerun with: -v
==85595== Use --track-origins=yes to see where uninitialised values come from
==85595== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 1 from 1)
==85596== LEAK SUMMARY:
==85596== definitely lost: 3,839 bytes in 1 blocks
==85596== indirectly lost: 0 bytes in 0 blocks
==85596== possibly lost: 72 bytes in 3 blocks
==85596== still reachable: 336,638 bytes in 3,138 blocks
==85596== suppressed: 17,819 bytes in 153 blocks
==85596== Rerun with --leak-check=full to see details of leaked memory
==85596==
==85596== For counts of detected and suppressed errors, rerun with: -v
==85596== Use --track-origins=yes to see where uninitialised values come from
==85596== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 1 from 1)
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node cu-vpn-colorado-edu-198 exited on signal 11 (Segmentation fault: 11).
--------------------------------------------------------------------------
【问题讨论】:
-
#include "/usr/local/include/mpi.h"哎哟,真的吗? -
在我看来这只是不兼容。
-
@LightnessRacesinOrbit 我不关注?包括确保我使用与我的 mpi 编译器兼容的 mpi.h 的路径。它实际上是一个指向 brew 安装它的 /usr/local/Cellar 的指针。为了兼容性,valgrind是否只兼容某些类型的mpi编译器,还是你的意思是别的?
-
在
#include指令中写入绝对路径是非常不习惯的,并且会缩短程序的保质期。如果你需要一个非标准的包含路径,你应该在你的编译命令中传递-I/usr/local/Cellar,并且像往常一样只传递#include "mpi.h"。 -
MPI方式是#include <mpi.h>然后使用mpi{cc,cpc,fort}包装器,因此您不必担心包含和库路径,也不必担心应该链接哪个库。