并行数字积分器功能比顺序版本慢。为什么？答案

【问题标题】：Parallel numeric integrator function slower than sequential version. Why?并行数字积分器功能比顺序版本慢。为什么？
【发布时间】：2018-10-05 12:12:33
【问题描述】：

我正在学习 Haskell 中的并行策略。

问题

我写了基于梯形规则（顺序）的积分函数：

integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a 
integrateT f (ini, fin) dx 
  = let lst = map f [ini,ini+dx..fin]
    in sum lst * dx - 0.5 * (f ini + f fin) * dx

我按如下方式运行它：

main = do
  print $ (integrateT (\x -> x^4 - x^3 + x^2 + x/13 + 1) (0.0,1000000.0) 0.01 :: Double)

以下是运行的统计数据：

stack exec lab5 -- +RTS -ls -N2 -s
1.9999974872991426e29
  18,400,147,552 bytes allocated in the heap
      20,698,168 bytes copied during GC
          66,688 bytes maximum residency (2 sample(s))
          35,712 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause
  Gen  0     17754 colls, 17754 par    0.123s   0.105s     0.0000s    0.0011s
  Gen  1         2 colls,     1 par    0.000s   0.000s     0.0001s    0.0002s

  Parallel GC work balance: 0.27% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    6.054s  (  5.947s elapsed)
  GC      time    0.123s  (  0.106s elapsed)
  EXIT    time    0.001s  (  0.008s elapsed)
  Total   time    6.178s  (  6.061s elapsed)

  Alloc rate    3,039,470,269 bytes per MUT second

  Productivity  98.0% of total user, 98.2% of total elapsed

gc_alloc_block_sync: 77
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

如您所见，它运行得非常快。但是，当我尝试像这样使其并行时：

integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a 
integrateT f (ini, fin) dx 
  = let lst = (map f [ini,ini+dx..fin]) `using` parListChunk 100 rdeepseq
    in sum lst * dx - 0.5 * (f ini + f fin) * dx

它总是运行更长的时间。以下是上面运行的示例统计信息：

stack exec lab5 -- +RTS -ls -N2 -s
1.9999974872991426e29
  59,103,320,488 bytes allocated in the heap
  17,214,458,128 bytes copied during GC
  2,787,092,160 bytes maximum residency (15 sample(s))
  43,219,264 bytes maximum slop
        5570 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause
  Gen  0     44504 colls, 44504 par   16.907s  10.804s     0.0002s    0.0014s
  Gen  1        15 colls,    14 par    4.006s   2.991s     0.1994s    1.2954s

  Parallel GC work balance: 33.60% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)

  SPARKS: 1000001 (1000001 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   14.298s  ( 12.392s elapsed)
  GC      time   20.912s  ( 13.795s elapsed)
  EXIT    time    0.000s  (  0.003s elapsed)
  Total   time   35.211s  ( 26.190s elapsed)

  Alloc rate    4,133,806,996 bytes per MUT second

  Productivity  40.6% of total user, 47.3% of total elapsed

gc_alloc_block_sync: 2304055
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 1105370

我在这里看到的几件事：

使用的内存更多
更长的时间
大量时间花在 GC 上

我还做了什么

我做了一些实验：

使用 parMap、parList 和自定义 parListChunk' 函数来评估列表 - 每次结果都比顺序版本差得多
使用不同的块大小 - 从非常小的 5 到列表长度的一半 - 结果每次都比顺序版本差得多
将主函数中的因子更改为非常大的值，例如 x^123442，例如添加了更多的除法而不是乘法等。而且我还减少了问题的范围。所有这些都是为了减少火花，但在计算上更昂贵。在这里，我得到了类似于顺序版本的结果（使用这些新功能运行大约 28 秒） - 并行运行在 31 秒内完成
使用 Threadscope 对每次运行进行测试，以确保按预期使用了两个内核 - 确实如此！

问题

随着每个块的计算成本增加（例如 x^12345）和块数量的减少，并行性能得到改善 - 是不是在因素非常小时（例如 x^4、x^3 - 计算速度快），因此顺序版本更快？有没有办法以更好的性能成功并行化它？
为什么并行版本使用这么多内存和 GC 时间？
如何减少并行版本的 GC 时间？

【问题讨论】：

我认为顺序的已优化为在恒定内存中运行：“66,688 字节最大驻留”。相反，并行使用> 2GB。我不知道为什么。

标签： haskell

【解决方案1】：

parListChunk 这样的策略只有在到目前为止大部分计算成本都用于评估单个列表元素时才有意义，而首先构建列表脊椎等的开销。然而，使用integrateT 在一个简单的多项式上，这些元素的计算成本非常低，大部分成本将在列表开销中。它在顺序版本中仍然有效运行的唯一原因是 GHC 可以内联/融合大部分业务，但显然不是在并行版本中。

解决方案：让 每个线程 运行一个适当的可融合顺序版本，即划分 interval 而不是其离散列表形式。喜欢

integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a 
integrateT f (ini, fin) dx 
  = let lst = map f [ini,ini+dx..fin]
    in sum lst * dx - 0.5 * (f ini + f fin) * dx

integrateT_par :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a
integrateT_par f (l,r) dx
  = let chunks = [ integrateT f (l + i*wChunk, l + (i+1)*wChunk) dx
                 | i<-[0..nChunks-1] ]
               `using` parList rdeepseq
    in sum chunks
 where nChunks = 100
       wChunk = (r-l)/nChunks

这样的内存或性能不会比顺序版本差很多。

请注意，当我现在对其进行测试时，它实际上似乎根本没有并行运行，但很确定我只是错过了设置中的某些内容...

【讨论】：

我测试了你的版本 - 它显着加快了执行速度。同样在 ThreadScope 中检查它发现执行发生在 2 个内核上，使用它们非常有效。谢谢！
我仍然想知道在我的实现中 GCd 到底是什么。
@GrayCat 我认为它只是有一个很大的列表。如果你折腾一些{-# NOINLINE #-}，那么顺序版本也会变得非常低效。