【发布时间】:2017-09-12 18:20:56
【问题描述】:
我有一本 python 字典:
diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'}
我已经创建了一个这样的 RDD:
docNameToText = sc.parallelize(diction)
我需要计算找到每个文档中出现的前 2 个字符串。所以,结果应该是这样的:
1.txt, test, is
2.txt, test, that
我是 pyspark 的新手,我知道算法,但不知道如何使用 pyspark。我需要:
- convert the file-to-string => file-to-wordFreq
- arrange wordFreq in non-increasing order - if two words have the same freq, arrange them in alphabetical order
- display the top 2
我该如何实现?
【问题讨论】:
标签: python apache-spark pyspark spark-streaming