python map reduce中的正则表达式：用«ñ»和重音元音计数单词答案

【问题标题】：Regular expressions in python map reduce: Counting words with «ñ» and accented vowelspython map reduce中的正则表达式：用«ñ»和重音元音计数单词
【发布时间】：2015-12-06 08:57:56
【问题描述】：

我使用正则表达式来处理西班牙语文本中的重读元音和«ñ»，方法如下：

WORD_REGEXP = re.compile(r"[a-zA-Záéíóúñ]+")

虽然它适用于任何字符串，但当我执行 map reduce 程序时，它无法正确处理带有诸如 «acción» 之类的重音的西班牙语单词，并且该单词在生成的文件中显示为剪切。有这样一行

acci: 6

instead of:

acción: 6

这里是python代码。有什么建议么？谢谢。

# -*- coding: utf-8 -*-
from mrjob.job import MRJob
import re

WORD_REGEXP = re.compile(r"[a-zA-Záéíóúñ]+")

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        words = WORD_REGEXP.findall(line)
        for word in words:
            yield word.lower(), 1

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRWordFrequencyCount.run()

【问题讨论】：

嗯...WORD_REGEXP.findall(line) 给了我['acci', 'instead', 'of', 'acción']。这不正确吗？预期的输出是什么？
预期的输出将是完整的密钥：«acción»而不是«acci»

标签： python regex mapreduce mrjob

【解决方案1】：

这似乎是一个编码问题。

documentation 建议使用BytesValueProtocol 强制编码。

class MREncodingEnforcer(MRJob):

    INPUT_PROTOCOL = BytesValueProtocol

    def mapper(self, _, value):
        value = value.decode('utf_8')
        ...

【讨论】：