Lazy IO + Parallelism：将图像转换为灰度答案

【问题标题】：Lazy IO + Parallelism: converting an image to grayscaleLazy IO + Parallelism：将图像转换为灰度
【发布时间】：2014-07-03 04:48:13
【问题描述】：

我正在尝试将并行性添加到将 .bmp 转换为灰度 .bmp 的程序中。我看到并行代码的性能通常会差 2-4 倍。我正在调整 parBuffer / 分块大小，但似乎仍然无法推理。寻求指导。

这里使用的整个源文件：http://lpaste.net/106832

我们使用Codec.BMP 读取type RGBA = (Word8, Word8, Word8, Word8) 表示的像素流。要转换为灰度，只需在所有像素上映射一个“luma”变换。

串行实现字面意思是：

toGray :: [RGBA] -> [RGBA]
toGray x = map luma x

测试输入 .bmp 为 5184 x 3456 (71.7 MB)。

串行实现的运行时间约为 10 秒，约为 550 纳秒/像素。 Threadscope 看起来很干净：

为什么这么快？我想它有一些惰性字节字符串（即使 Codec.BMP 使用严格的字节字符串——这里是否发生了隐式转换？）和融合。

添加并行性

添加并行性的第一次尝试是通过parList。好家伙。程序使用了约 4-5GB 内存，系统开始交换。

然后我阅读了 Simon Marlow 的 O'Reilly 书中的“Parallelizing Lazy Streams with parBuffer”部分，并尝试了大尺寸的 parBuffer。这仍然没有产生理想的性能。火花尺寸非常小。

然后我尝试通过对惰性列表进行分块来增加火花大小，然后坚持使用parBuffer 来实现并行性：

toGrayPar :: [RGBA] -> [RGBA]
toGrayPar x = concat $ (withStrategy (parBuffer 500 rpar) . map (map luma))
                       (chunk 8000 x)

chunk :: Int -> [a] -> [[a]]
chunk n [] = []
chunk n xs = as : chunk n bs where
  (as,bs) = splitAt (fromIntegral n) xs

但这仍然不能产生理想的性能：

  18,934,235,760 bytes allocated in the heap
  15,274,565,976 bytes copied during GC
     639,588,840 bytes maximum residency (27 sample(s))
     238,163,792 bytes maximum slop
            1910 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     35277 colls, 35277 par   19.62s   14.75s     0.0004s    0.0234s
  Gen  1        27 colls,    26 par   13.47s    7.40s     0.2741s    0.5764s

  Parallel GC work balance: 30.76% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)

  SPARKS: 4480 (2240 converted, 0 overflowed, 0 dud, 2 GC'd, 2238 fizzled)

  INIT    time    0.00s  (  0.01s elapsed)
  MUT     time   14.31s  ( 14.75s elapsed)
  GC      time   33.09s  ( 22.15s elapsed)
  EXIT    time    0.01s  (  0.12s elapsed)
  Total   time   47.41s  ( 37.02s elapsed)

  Alloc rate    1,323,504,434 bytes per MUT second

  Productivity  30.2% of total user, 38.7% of total elapsed

gc_alloc_block_sync: 7433188
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 1017408

我怎样才能更好地解释这里发生了什么？

【问题讨论】：

您是否建立了合理的基线时间？只计算[RGBA] 的长度需要多长时间？由于您的其他 cmets 表明该值正在使用惰性 IO 进行流式传输，因此 IO 时间很可能将始终主导您所做的任何处理，无论是否并行。那么有多少运行时间只是 IO 和解析呢？
我可以试试看IO和Codec.BMP解析需要多长时间。我使用的基线是大约需要 10 秒的串行实现。我认为这足以与并行实现所需的 30-40 秒进行比较。

标签： haskell parallel-processing

【解决方案1】：

您有一个很大的 RGBA 像素列表。为什么不使用具有合理块大小的parListChunk？

【讨论】：

这似乎更像是一个评论而不是一个答案，它并没有解决 OP 的问题，而只是提出了一些尝试的建议。
parListChunk 强制占用大量内存的 [5184 x 3456] 图像的脊椎。我试图避免这种情况并仍然使用惰性 IO。