分析后的 mpi4py 性能差异答案

【问题标题】：mpi4py performance discrepancy after profiling分析后的 mpi4py 性能差异
【发布时间】：2019-05-01 13:31:09
【问题描述】：

我一直在使用 MPI4py 数组进行一些工作，最近我发现使用 Scatterv() 函数后性能有所提高。我开发了一个代码来检查输入对象的数据类型，如果它是一个数字 numpy 数组，它会使用Scatterv() 执行散射，否则它会使用正确实现的函数来执行。

代码如下所示：

import numpy as np
from mpi4py import MPI
import cProfile
from line_profiler import LineProfiler

def ScatterV(object, comm, root = 0):
    optimize_scatter, object_type = np.zeros(1), None

    if rank == root:
        if isinstance(object, np.ndarray):
            if object.dtype in [np.float64, np.float32, np.float16, np.float,
                                np.int, np.int8, np.int16, np.int32, np.int64]:
                optimize_scatter = 1
                object_type = object.dtype

            else: optimize_scatter, object_type = 0, None
        else: optimize_scatter, object_type = 0, None

        optimize_scatter = np.array(optimize_scatter, dtype=np.float64).ravel()

    comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
    object_type = comm.bcast(object_type, root=root)

    if int(optimize_scatter) == 1:

        if rank == root:

            displs = [int(i)*object.shape[1] for i in
                          np.linspace(0, object.shape[0], comm.size + 1)]
            counts = [displs[i+1] - displs[i] for i in range(len(displs)-1)]
            lens = [int((displs[i+1] - displs[i])/(object.shape[1]))
                        for i in range(len(displs)-1)]
            displs = displs[:-1]
            shape = object.shape

            object = object.ravel().astype(np.float64, copy=False)

        else:
            object, counts, displs, shape, lens = None, None, None, None, None

        counts = comm.bcast(counts, root=root)
        displs = comm.bcast(displs, root=root)
        lens = comm.bcast(lens, root=root)
        shape = list(comm.bcast(shape, root=root))

        shape[0] = lens[rank]
        shape = tuple(shape)

        x = np.zeros(counts[rank])

        comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)


        return  np.reshape(x, (-1,) + shape[1:]).astype(object_type, copy=False)

    else:
        return comm.scatter(object, root=root)


comm = MPI.COMM_WORLD
size, rank = comm.Get_size(), comm.Get_rank()



if rank == 0:
    arra = (np.random.rand(10000000, 10) * 100).astype(np.float64, copy=False)
else: arra = None

lp = LineProfiler()

lp_wrapper = lp(ScatterV)
lp_wrapper(arra, comm)

if rank == 4: lp.print_stats()


pr = cProfile.Profile()
pr.enable()

f2 = ScatterV(arra, comm)

pr.disable()

if rank == 4: pr.print_stats()

使用LineProfiler 的分析产生以下结果[仅显示冲突行]：

Timer unit: 1e-06 s

Total time: 2.05001 s
File: /media/SETH_DATA/SETH_Alex/BigMPI4py/prueba.py
Function: ScatterV at line 26

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   ...                                          
    41         1    1708453.0 1708453.0     83.3      comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
    42         1        148.0    148.0      0.0      object_type = comm.bcast(object_type, root=root)
   ...                                
    76         1        264.0    264.0      0.0          counts = comm.bcast(counts, root=root)
    77         1         16.0     16.0      0.0          displs = comm.bcast(displs, root=root)
    78         1         14.0     14.0      0.0          lens = comm.bcast(lens, root=root)
    79         1          9.0      9.0      0.0          shape = list(comm.bcast(shape, root=root))
 ...                                
    86         1     340971.0 340971.0     16.6          comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)

使用cProfile 的分析产生以下结果：

         17 function calls in 0.462 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.127    0.127    0.127    0.127 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
        1    0.335    0.335    0.335    0.335 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
        5    0.000    0.000    0.000    0.000 {method 'bcast' of 'mpi4py.MPI.Comm' objects}

在这两种情况下，Bcast 方法与ScatterV 方法相比都消耗大量时间。更重要的是，使用 LinePprofiler，Bcast 方法比 ScatterV 方法慢 5 倍，这对我来说似乎完全不连贯，因为 Bcast 只广播 10 个元素的数组。

如果我交换第 41 行和第 42 行，结果如下：

LineProfiler

41         1    1666718.0 1666718.0     83.0      object_type = comm.bcast(object_type, root=root)
42         1         47.0     47.0      0.0      comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
87         1     341728.0 341728.0     17.0          comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)

cProfile

1    0.000    0.000    0.000    0.000 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
1    0.339    0.339    0.339    0.339 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
5    0.129    0.026    0.129    0.026 {method 'bcast' of 'mpi4py.MPI.Comm' objects}

如果我改变要分散的数组的大小，ScatterV 和Bcast 的时间消耗也不同，速度相同。例如，如果我将大小增加 10 倍（100000000），结果是：

LineProfiler

41         1   16304301.0 16304301.0     82.8      comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
42         1        235.0    235.0      0.0      object_type = comm.bcast(object_type, root=root)
87         1    3393658.0 3393658.0     17.2          comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)

cProfile

 1    1.348    1.348    1.348    1.348 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
    1    4.517    4.517    4.517    4.517 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
    5    0.000    0.000    0.000    0.000 {method 'bcast' of 'mpi4py.MPI.Comm' objects}

如果我没有选择排名 4 的结果，而是选择任何排名 > 1 的结果，则会发生相同的结果。但是，对于 rank = 0，结果不同：

LineProfiler

41         1        186.0    186.0      0.0      comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
42         1        244.0    244.0      0.0      object_type = comm.bcast(object_type, root=root)
87         1    4722349.0 4722349.0    100.0          comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)

cProfile

    1    0.000    0.000    0.000    0.000 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
    1    5.921    5.921    5.921    5.921 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
    5    0.000    0.000    0.000    0.000 {method 'bcast' of 'mpi4py.MPI.Comm' objects}

在这种情况下，Bcast 方法的计算时间与其他 bcast 方法相似。

我也尝试过，而不是在第 41 行使用 Bcast，而是使用 bcast 和 scatter，这会产生相同的结果。

鉴于此，我认为增加的时间消耗被错误地归因于第一次广播，这意味着两个分析器都会为并行化过程产生错误的计时。

我很确定分析器的内部结构不适用于可并行化的函数，但我发布此问题是为了了解是否有人遇到过类似的结果。

【问题讨论】：

请记住，分析器报告发送/接收消息所花费的时间加上非 root 等级等待消息的时间（可以说是同步）。如果你有很多不平衡，一些任务可能会在广播中花费大量时间等待。然后它们对于 scatterv 是同步的，因此分析器报告它更快）。出于计时目的，您可以在广播前添加MPI_Barrier()。我猜大部分时间都会花在barrier上（不，barrier不慢，你主要是测量不平衡），scatterv()会比bcast()慢。

标签： python-3.x profiling mpi mpi4py cprofile

【解决方案1】：

作为对 Gilles Gouaillardet 的回应，我在每个 bcast 调用之前和之后的行中都包含了 comm.Barrier()，并且大部分信号都在这些 comm.Barrier() 调用中进行了总结。

这是LineProfiler 的示例。

Timer unit: 1e-06 s

Total time: 2.17248 s
File: /media/SETH_DATA/SETH_Alex/BigMPI4py/prueba.py
Function: ScatterV at line 26

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    26                                           def ScatterV(object, comm, root = 0):
    27         1          7.0      7.0      0.0      optimize_scatter, object_type = np.zeros(1), None
    28                                           
    29         1          2.0      2.0      0.0      if rank == root:
    30                                                   if isinstance(object, np.ndarray):
    31                                                       if object.dtype in [np.float64, np.float32, np.float16, np.float,
    32                                                                           np.int, np.int8, np.int16, np.int32, np.int64]:
    33                                                           optimize_scatter = 1
    34                                                           object_type = object.dtype
    35                                           
    36                                                       else: optimize_scatter, object_type = 0, None
    37                                                   else: optimize_scatter, object_type = 0, None
    38                                           
    39                                                   optimize_scatter = np.array(optimize_scatter, dtype=np.float64).ravel()
    40                                           
    41         1    1677662.0 1677662.0     77.2      comm.Barrier()
    42         1         76.0     76.0      0.0      comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
    43         1        345.0    345.0      0.0      comm.Barrier()
    44         1        111.0    111.0      0.0      object_type = comm.bcast(object_type, root=root)
    45         1        166.0    166.0      0.0      comm.Barrier()
    46                                           
    47                                           
    48                                           
    49         1          7.0      7.0      0.0      if int(optimize_scatter) == 1:
    50                                           
    51         1          2.0      2.0      0.0          if rank == root:
    52                                                       if object.ndim > 1:
    53                                                           displs = [int(i)*object.shape[1] for i in
    54                                                                     np.linspace(0, object.shape[0], comm.size + 1)]
    55                                                       else:
    56                                                           displs = [int(i) for i in np.linspace(0, object.shape[0], comm.size + 1)]
    57                                           
    58                                                       counts = [displs[i+1] - displs[i] for i in range(len(displs)-1)]
    59                                           
    60                                                       if object.ndim > 1:
    61                                                           lens = [int((displs[i+1] - displs[i])/(object.shape[1]))
    62                                                                   for i in range(len(displs)-1)]
    63                                                       else:
    64                                                           lens = [displs[i+1] - displs[i] for i in range(len(displs)-1)]
    65                                           
    66                                                       displs = displs[:-1]
    67                                           
    68                                           
    69                                                       shape = object.shape
    70                                           
    71                                           
    72                                           
    73                                                       if object.ndim > 1:
    74                                                           object = object.ravel().astype(np.float64, copy=False)
    75                                           
    76                                           
    77                                                   else:
    78         1          2.0      2.0      0.0              object, counts, displs, shape, lens = None, None, None, None, None
    79                                           
    80         1        295.0    295.0      0.0          counts = comm.bcast(counts, root=root)
    81         1         66.0     66.0      0.0          displs = comm.bcast(displs, root=root)
    82         1          6.0      6.0      0.0          lens = comm.bcast(lens, root=root)
    83         1          9.0      9.0      0.0          shape = list(comm.bcast(shape, root=root))
    84                                           
    85         1          2.0      2.0      0.0          shape[0] = lens[rank]
    86         1          3.0      3.0      0.0          shape = tuple(shape)
    87                                           
    88         1         33.0     33.0      0.0          x = np.zeros(counts[rank])
    89                                           
    90         1         76.0     76.0      0.0          comm.Barrier()
    91         1     351187.0 351187.0     16.2          comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
    92         1     142352.0 142352.0      6.6          comm.Barrier()
    93                                           
    94         1          5.0      5.0      0.0          if len(shape) > 1:
    95         1         66.0     66.0      0.0              return  np.reshape(x, (-1,) + shape[1:]).astype(object_type, copy=False)
    96                                                   else:
    97                                                       return x.view(object_type)
    98                                           
    99                                           
   100                                               else:
   101                                                   return comm.scatter(object, root=root)

77.2% 的时间花在第一个 comm.Barrier() 元素上，所以我可以有把握地假设，bcast 调用都不会占用如此多的时间。我会考虑添加comm.Barrier()calls 以供将来进行分析。

【讨论】：

记住添加一个barrier()会影响你的应用程序行为（最坏的情况是导致死锁）所以你至少应该仔细检查经过的时间是否大致相同屏障和探查器。