2 个键的 MapReduce Reducer - Python答案

【问题标题】：MapReduce Reducer of 2 Keys - Python2 个键的 MapReduce Reducer - Python
【发布时间】：2019-02-25 08:20:21
【问题描述】：

这应该很简单，我已经花了几个小时。

示例数据（名称、二进制、计数）：

Adam 0 1
Adam 1 1
Adam 0 1
Mike 1 1
Mike 0 1
Mike 1 1

所需的示例输出（名称、二进制、计数）：

Adam 0 2
Adam 1 1
Mike 0 1
Mike 1 2

每个名字都需要有自己的二进制key，0或1。根据二进制Key，对count列求和。注意所需输出中的“减少”。

我已经提供了一些我的代码，我正在尝试在 reducer 中不使用列表或字典。

""" Reducer 将名称与它们的二进制文件和部分计数相加

输入：名称 \t 二进制 \t pCount

输出：名称 \t 二进制 \t tCount
"""

import re
import sys

current_name = None
zero_count, one_count = 0,0

for line in sys.stdin:
    # parse the input
    name, binary, count = line.split('\t')

   if name == current_name:
      if int(binary) == 0:
        zero_count += int(count)

    elif int(binary) == 1:
        one_count += int(count)
else:
    if current_name:
        print(f'{current_name}\t{0} \t{zero_count}')
        print(f'{current_name}\t{1} \t{one_count}')
    current_name, binary, count = word, int(binary), int(count)

print(f'{current_name}\t{1} \t{count}')

由于某种原因，它没有正确打印。（通过的名字很时髦）我也不确定通过所有打印的最佳方法 one_count 和 zero_count 也显示其二进制标签。

任何帮助将不胜感激。谢谢！

【问题讨论】：

标签： python hadoop mapreduce hadoop-streaming reducers

【解决方案1】：

我认为最好使用 pandas 库。

import pandas as pd
from io import StringIO
a ="""Adam 0 1
Adam 1 1
Adam 0 1
Mike 1 1
Mike 0 1
Mike 1 1"""

text = StringIO(a)
name, binary, count = [],[],[]

for line in text.readlines():
    a = line.strip().split(" ")
    name.append(a[0])
    binary.append(a[1])
    count.append(a[2])

df = pd.DataFrame({'name': name, "binary": binary, "count": count})
df['count'] = df['count'].astype(int)
df = df.groupby(['name', 'binary'])['count'].sum().reset_index()
print(df)
name    binary  count
0   Adam    0   2
1   Adam    1   1
2   Mike    0   1
3   Mike    1   2

如果您的数据已经在 csv 或文本文件中。可以使用 pandas 读取。

df = pd.read_csv('path to your file')

【讨论】：

我认为 OP 虽然想使用 Hadoop

【解决方案2】：

压痕不好，条件处理不当。

import re
import sys

current_name = None
zero_count, one_count = 0,0
i = 0
for line in sys.stdin:
    # parse the input
    name, binary, count = line.split('\t')
    #print(name)
    #print(current_name)
    if(i == 0):
        current_name = name
        i  = i + 1
    if(name == current_name):
        if int(binary) == 0:
            zero_count += int(count)

        elif int(binary) == 1:
            one_count += int(count)
    else:
        print(f'{current_name}\t{0} \t{zero_count}')
        print(f'{current_name}\t{1} \t{one_count}')
        current_name = name
        #print(current_name)
        zero_count, one_count = 0,0
        if int(binary) == 0:
            zero_count += int(count)
        elif int(binary) == 1:
            one_count += int(count)
print(f'{current_name}\t{0} \t{zero_count}')
print(f'{current_name}\t{1} \t{one_count}')

“i”处理第一行输入没有“current_name”的情况（它只会运行一次）。
在 else 块中，你已经重新初始化了 'zero_count' 和 'one_count'，并计算了新的 'current_name'。

我的代码的输出：

Adam    0       2
Adam    1       1
Mike    0       1
Mike    1       2

【讨论】：