【发布时间】:2019-06-10 16:19:12
【问题描述】:
我的程序的目的是计算内导体和外导体之间的静电势,方法是将其分成网格,然后分成网格切片。每个处理器获得一个切片并在每个切片上运行计算。我使用 MPI_Isend 和 MPI_Irecv 在处理器之间发送数据。测试代码时出现分段错误:
[physnode5:81440] *** Process received signal ***
[physnode5:81440] Signal: Segmentation fault (11)
[physnode5:81440] Signal code: Address not mapped (1)
[physnode5:81440] Failing at address: 0x58
[physnode5:81440] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2ab8069df5d0]
[physnode5:81440] [ 1] /opt/yarcc/libraries/openmpi/2.1.0/1/default/lib/libmpi.so.20(ompi_request_default_wait+0xd)[0x2ab8066495ed]
[physnode5:81440] [ 2] /opt/yarcc/libraries/openmpi/2.1.0/1/default/lib/libmpi.so.20(MPI_Wait+0x5d)[0x2ab80667a00d]
[physnode5:81440] [ 3] ./mpi_tezt.exe[0x400ffc]
[physnode5:81440] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab806c0e3d5]
[physnode5:81440] [ 5] ./mpi_tezt.exe[0x4009b9]
[physnode5:81440] *** End of error message ***
当这段代码被执行时。请不要我已经 ssh 到一个集群。文件名是 mpi_tezt.exe(是的,我拼错了)。 我已经检查了我要发送的数组是否正确分配,并且 send 和 recv 没有发送或接收不存在的数据(即发送数组范围之外的数据。 我的 MPI_Isend 和 MPI_Irecv 代码如下:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
/*MPI Specific Variables*/
int my_size, my_rank, up, down;
MPI_Request reqU, reqD, sreqU, sreqD;
MPI_Status rUstatus, rDstatus, sUstatus, sDstatus;
/*Physical Dimensions*/
double Linner = 5.0;/*mm*/
double Rinner = 1.0;/*mm*/
double phi_0 = 1000.0;/*V*/
/*Other Variables*/
int grid_size = 100;
int slice;
int x,y;
double grid_res_y = 0.2;
double grid_res_x = 0.1;
int xboundary, yboundary;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &my_size);
/*Determining neighbours*/
if (my_rank != 0) /*if statemets used to stop highest and lowest rank neighbours arent outside 0 - my_size-1 range of ranks*/
{
up = my_rank-1;
}
else
{
up = 0;
}
if(my_rank != my_size-1)
{
down = my_rank+1;
}
else
{
down = my_size-1;
}
/*cross-check: presumed my_size is a factor of gridsize else there are odd sized slices and this is not coded for*/
if (grid_size%my_size != 0)
{
printf("ERROR - number of procs = %d, this is not a factor of grid_size %d\n", my_size, grid_size);
exit(0);
}
/*Set Up Distributed Data Approach*/
slice = grid_size/my_size;
yboundary = Linner/grid_res_y; /*y grid index of inner conductor wall*/
xboundary = Rinner/grid_res_x; /*x grid and individual array index of inner conductor wall*/
double phi[slice+2][grid_size]; /*extra 2 rows to allow for halo data*/
for (y=0; y < slice+2; y++)
{
for (x=0; x < grid_size; x++)
{
phi[y][x] = 0.0;
}
}
if(my_rank == 0) /*Boundary Containing rank does 2 loops. One over part with inner conductor and one over part without inner conductor*/
{
for(y=0; y < slice+1; y++)
{
for(x=xboundary; x < grid_size; x++)
{
phi[y][x] = phi_0;
}
}
}
if (my_rank < my_size-1)
{
/*send top most strip up one node to be recieved as bottom halo*/
MPI_Isend(&phi[1][0], grid_size , MPI_DOUBLE, down, 1, MPI_COMM_WORLD, &sreqU);
/*recv top halo from up one node*/
MPI_Irecv(&phi[slice+1][0], grid_size, MPI_DOUBLE, down, 2, MPI_COMM_WORLD, &reqU);
}
if (my_rank > 0)
{
/*recv top halo from down one node*/
MPI_Irecv(&phi[0][0], grid_size , MPI_DOUBLE, up, 2, MPI_COMM_WORLD, &reqD);
/*send bottom most strip down one node to be recieved as top halo*/
MPI_Isend(&phi[slice][0], grid_size , MPI_DOUBLE, up, 1, MPI_COMM_WORLD, &sreqD);
}
if (my_rank<my_size-1)
{
/*Wait for send to down one rank to complete*/
MPI_Wait(&sreqD, &sDstatus);
/*Wait for recieve from up one rank to complete*/
MPI_Wait(&reqD, &rDstatus);
}
if (my_rank>0)
{
/*Wait for send to up down one rank to complete*/
MPI_Wait(&sreqU, &sUstatus);
/*Wait for recieve from down one rank to complete*/
MPI_Wait(&reqU, &rUstatus);
}
MPI_Finalize();
return 0;
}
我一直在测试 2 个处理器(排名 0 和 1),希望能扩展到更多。
有什么想法可能出在哪里吗?
【问题讨论】:
-
哇,代码太多了。
-
你试过在调试器下运行吗?它通常会将您带到发生分段错误的位置,并提供当时程序状态的完整快照。如果由于某种原因不能这样做,您可以设置
ulimit -c以允许核心转储,并改为调试核心文件。 -
这个错误的地方太多了。但是,您是否更改了 slice 和 grid_size 是合理的。即使在那之后检查有很多可能误入歧途的数组引用。这些行: if (my_rank = A_proc) 不会做你想让他们做的事情
-
注意minimal reproducible example中的minimal这个词。尝试将您发布的代码集中在您所询问的特定问题上,即使您必须将其分割一些来说明。
-
@brothir 我在每个等待语句之前添加了打印语句。未打印等待 sreqD 之间的一处。可能是 MPI_Wait 的问题?