如何在 Python 多处理中的所有进程之间共享数据？答案

【问题标题】：How to share data between all process in Python multiprocessing?如何在 Python 多处理中的所有进程之间共享数据？
【发布时间】：2018-06-18 03:16:21
【问题描述】：

我想在给定文章中搜索预定义的关键字列表，如果在文章中找到关键字，则将分数加 1。我想使用多处理，因为预定义的关键字列表非常大 - 10k 个关键字和文章数量是 100k。

我遇到了this 的问题，但它没有解决我的问题。

我尝试了这个实现，但得到了None。

keywords = ["threading", "package", "parallelize"]

def search_worker(keyword):
    score = 0
    article = """
    The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""

   if keyword in article:
        score += 1
    return score

我尝试了以下两种方法，但得到了三个None。

方法一：

 pool = mp.Pool(processes=4)
 result = [pool.apply(search_worker, args=(keyword,)) for keyword in keywords]

方法二：

result = pool.map(search_worker, keywords)
print(result)

实际输出： [None, None, None]

预期输出： 3

我想将预定义的关键字列表和文章一起发送给工作人员，但我不确定我是否朝着正确的方向前进，因为我之前没有多处理经验。

提前致谢。

【问题讨论】：

为什么不使用 ElasticSearch 作为您的搜索引擎？
我不确定如何使用 ElasticSearch 执行此操作。我想根据关键字列表和索引文章计算每篇文章的置信度分数以及置信度分数。
ElasticSearch 可以轻松做到这一点！你真的应该试试
您的情况有不同的解决方案。一，你可以有一个共享内存，就像一个数据库。 Redis 真的很简单，而且效果很好。根据您的规模计划和计划的复杂性，采用一些 map-reduce 技术。
您的代码在我运行时运行良好（python3.5）。（我得到 [1, 1, 1]，你只需要一个全局计数或对结果求和）。你还记得使用if __name__ == '__main__'运行方法1和方法2吗？

标签： python python-3.x python-2.7 multiprocessing python-multiprocessing

【解决方案1】：

用户e.s 解决了他评论中的主要问题，但我发布了Om Prakash 请求传递的评论的解决方案：

工作方法的文章和预定义的关键字列表

这是一个简单的方法。您需要做的就是构建一个包含您希望工作人员处理的参数的元组：

from multiprocessing import Pool

def search_worker(article_and_keyword):
    # unpack the tuple
    article, keyword = article_and_keyword

    # count occurrences
    score = 0
    if keyword in article:
        score += 1

    return score

if __name__ == "__main__":
    # the article and the keywords
    article = """The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""
    keywords = ["threading", "package", "parallelize"]

    # construct the arguments for the search_worker; one keyword per worker but same article
    args = [(article, keyword) for keyword in keywords]

    # construct the pool and map to the workers
    with Pool(3) as pool:
        result = pool.map(search_worker, args)
    print(result)

如果您使用的是更高版本的 python，我建议您尝试starmap，因为这样会更简洁。

【讨论】：

【解决方案2】：

这是一个使用Pool 的函数。您可以传递 text 和 keyword_list 并且它会起作用。您可以使用Pool.starmap 来传递(text, keyword) 的元组，但您需要处理一个对text 有10k 次引用的可迭代对象。

from functools import partial
from multiprocessing import Pool

def search_worker(text, keyword):
    return int(keyword in text)

def parallel_search_text(text, keyword_list):
    processes = 4
    chunk_size = 10
    total = 0
    func = partial(search_worker, text)
    with Pool(processes=processes) as pool:
        for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
            total += result

    return total

if __name__ == '__main__':
    texts = []  # a list of texts
    keywords = []  # a list of keywords
    for text in texts:
        print(parallel_search_text(text, keywords))

创建工人池有开销。对一个简单的单进程文本搜索功能进行测试可能是值得的。通过创建Pool 的一个实例并将其传递给函数，可以加快重复调用的速度。

def parallel_search_text2(text, keyword_list, pool):
    chunk_size = 10
    results = 0
    func = partial(search_worker, text)

    for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
        results += result
    return results

if __name__ == '__main__':
    pool = Pool(processes=4)
    texts = []  # a list of texts
    keywords = []  # a list of keywords
    for text in texts:
        print(parallel_search_text2(text, keywords, pool))

【讨论】：