【发布时间】:2018-10-22 16:54:12
【问题描述】:
我正在尝试加入我从两个 MapReduce 作业中获得的结果。第一份工作返回 5 篇最有影响力的论文。下面是第一个 reducer 的代码。
import sys
import operator
current_word = None
current_count = 0
word = None
topFive = {}
# input comes from stdin
for line in sys.stdin:
line = line.strip()
# parse the input we got from mapper.py
word, check = line.split('\t')
if check != None:
count = 1
if current_word == word:
current_count += count
else:
if current_word:
topFive.update({current_word: current_count})
#print(current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print(current_word, current_count)
t = sorted(topFive.iteritems(), key=lambda x:-x[1])[:6]
print("Top five most cited papers")
count = 1
for x in t:
if x[0] != 'nan' and count <= 5:
print("{0}: {1}".format(*x))
count = count + 1
第二份工作找到5个最有影响力的作者,代码和上面的代码差不多。我想把这两个工作的结果加入他们,这样我就可以为每个作者确定他们最有影响力的 3 篇论文的平均引用次数。我不知道该怎么做,看来我需要以某种方式加入结果?
【问题讨论】:
标签: python hadoop mapreduce cloudera