【发布时间】:2019-10-27 19:37:47
【问题描述】:
我编写了一个自定义函数,可以从 .txt 文件中查找出现次数最多的单词。我需要通过 PySpark 作为 RDD 运行它
我写了一个函数top_five,它的唯一参数是file_name
import collections
def top_five(file_name):
file = open(file_name, 'r', encoding = 'utf8')
list1 = []
for line in file:
print(line)
words = line.split()
for i in words:
j =''.join(filter(str.isalpha, i))
j = j.lower()
if len(j) > 5:
list1.append(j)
count = collections.Counter(list1)
most_occur = count.most_common(5)
print("The most used words in the Applied Data Science Textbook is:")
for item in most_occur:
print("\t" + item[0] + " occured " + str(item[1]) + " times")
return
实际结果需要是top_five函数的最后3行,打印每个单词和出现次数
【问题讨论】:
标签: python apache-spark pyspark custom-function