为什么 pmap|reducers/map 不使用所有 cpu 内核？答案

【问题标题】：Why is pmap|reducers/map not using all cpu cores?为什么 pmap|reducers/map 不使用所有 cpu 内核？
【发布时间】：2016-09-10 19:20:56
【问题描述】：

我正在尝试解析一个包含一百万行的文件，每一行都是一个 json 字符串，其中包含有关一本书的一些信息（作者、内容等）。我正在使用iota 加载文件，因为如果我尝试使用slurp，我的程序会抛出OutOfMemoryError。我还使用cheshire 来解析字符串。该程序只需加载一个文件并计算所有书籍中的所有单词。

我的第一次尝试包括pmap 来完成繁重的工作，我认为这基本上会利用我所有的 cpu 内核。

(ns multicore-parsing.core
  (:require [cheshire.core :as json]
            [iota :as io]
            [clojure.string :as string]
            [clojure.core.reducers :as r]))


(defn words-pmap
  [filename]
  (letfn [(parse-with-keywords [str]
            (json/parse-string str true))
          (words [book]
            (string/split (:contents book) #"\s+"))]
    (->>
     (io/vec filename)
     (pmap parse-with-keywords)
     (pmap words)
     (r/reduce #(apply conj %1 %2) #{})
     (count))))

虽然它似乎使用了所有核心，但每个核心很少使用超过 50% 的容量，我的猜测是它与 pmap 的批量大小有关，所以我偶然发现了一些 cmets 引用的relatively old question到clojure.core.reducers 库。

我决定用reducers/map重写函数：

(defn words-reducers
  [filename]
  (letfn [(parse-with-keywords [str]
            (json/parse-string str true))
          (words [book]
            (string/split (:contents book) #"\s+"))]
  (->>
   (io/vec filename)
   (r/map parse-with-keywords)
   (r/map words)
   (r/reduce #(apply conj %1 %2) #{})
   (count))))

但是cpu使用率更差，比之前的实现还要更久：

multicore-parsing.core=> (time (words-pmap "./dummy_data.txt"))
"Elapsed time: 20899.088919 msecs"
546
multicore-parsing.core=> (time (words-reducers "./dummy_data.txt"))
"Elapsed time: 28790.976455 msecs"
546

我做错了什么？ mmap加载+reducers是解析大文件时的正确方法吗？

编辑：this 是我正在使用的文件。

EDIT2：这里是iota/seq 而不是iota/vec 的时间安排：

multicore-parsing.core=> (time (words-reducers "./dummy_data.txt"))
"Elapsed time: 160981.224565 msecs"
546
multicore-parsing.core=> (time (words-pmap "./dummy_data.txt"))
"Elapsed time: 160296.482722 msecs"
546

【问题讨论】：

看起来io/vec 扫描整个文件以建立行所在位置的索引。如果你尝试io/seq，你会得到不同的结果吗？
@NathanDavis 我刚试过，时代更糟。让我更新问题
This talk Claypoole 的作者 Leon Barrett 可能有一些相关信息。它详细解释了pmap，包括为什么它经常不会使 CPU 饱和，以及为什么将一个pmap 输入另一个pmap 会产生令人惊讶的结果。此外，如果您主要是在寻找一种使 CPU 饱和的方法，那么 Claypoole 可能正是您所需要的。
不使 CPU 饱和：听起来它受 I/O 限制。也许使用 line-seq 会有所帮助，它会懒惰地读取行。另外，不要像这样连续两次调用pmap。最好使用(pmap (comp words parse-with-keywords))。尝试将尽可能多的处理打包到单个 pmap 调用中，因为每次调用时创建多个线程都会产生大量开销。如果使用单个pmap 调用完成的处理太少，则不值得使用它。
通常最好使用Criterium 库进行计时，尽管在您的情况下可能无关紧要。

标签： clojure reducers pmap cheshire

【解决方案1】：

我不相信 reducer 会成为适合您的解决方案，因为它们根本不能很好地处理惰性序列（reducer 会通过惰性序列给出正确的结果，但不会很好地并行化)。

您可能想看一下Seven Concurrency Models in Seven Weeks 书中的sample code（免责声明：我是作者），它解决了类似的问题（计算每个单词在维基百科上出现的次数）。

给定一个维基百科页面列表，此函数按顺序计算单词（get-words 返回页面中的单词序列）：

(defn count-words-sequential [pages]
  (frequencies (mapcat get-words pages)))

这是一个使用 pmap 的并行版本，它的运行速度确实更快，但速度只有 1.5 倍左右：

(defn count-words-parallel [pages]
  (reduce (partial merge-with +)
    (pmap #(frequencies (get-words %)) pages)))

它只快 1.5 倍的原因是因为 reduce 成为瓶颈 - 它为每个页面调用一次 (partial merge-with +)。在 4 核机器上合并 100 个页面的批次可将性能提高到大约 3.2 倍：

(defn count-words [pages]
  (reduce (partial merge-with +)
    (pmap count-words-sequential (partition-all 100 pages))))

【讨论】：

pages 是一个惰性序列吗？还是之前加载了所有页面？
pages 是懒惰的，是的。
您可以在此处查看正在加载页面的源代码：media.pragprog.com/titles/pb7con/code/FunctionalProgramming/…，为了完整起见，请在此处执行 get-words：media.pragprog.com/titles/pb7con/code/FunctionalProgramming/…