PySpark - 来自多个文件的前 n 个单词答案

【问题标题】：PySpark - top-n words from multiple files filesPySpark - 来自多个文件的前 n 个单词
【发布时间】：2017-09-12 18:20:56
【问题描述】：

我有一本 python 字典：

diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'}

我已经创建了一个这样的 RDD：

docNameToText = sc.parallelize(diction)

我需要计算找到每个文档中出现的前 2 个字符串。所以，结果应该是这样的：

1.txt, test, is
2.txt, test, that

我是 pyspark 的新手，我知道算法，但不知道如何使用 pyspark。我需要：

- convert the file-to-string => file-to-wordFreq
- arrange wordFreq in non-increasing order - if two words have the same freq, arrange them in alphabetical order
- display the top 2

我该如何实现？

【问题讨论】：

标签： python apache-spark pyspark spark-streaming

【解决方案1】：

只需使用Counter:

from collections import Counter 

(sc
    .parallelize(diction.items())
    # Split by whitepace
    .mapValues(lambda s: s.split())
    # Count
    .mapValues(Counter)
    # Take most commont
    .mapValues(lambda c: [x for (x, _) in c.most_common(2)]))

【讨论】：

谢谢！小说明：如果两个单词的计数相同，我还想按字母顺序排列结果怎么办？例如，对于 '1.csv'，结果应该是 ['test', 'that'] 而不是 ['that', 'test']。谢谢。