使用 np.matrix 和 np.chararray 时出现编码错误答案

【问题标题】：encoding error while using np.matrix and np.chararray使用 np.matrix 和 np.chararray 时出现编码错误
【发布时间】：2015-08-22 16:52:22
【问题描述】：

我正在建立一个西班牙语的 Flask 网站，供人们通过邮件编码的消息发送。基本上，您将文本粘贴到文本字段中，它会返回其编码版本。下面的函数 encode() 和 decode() 函数 OK，直到它处理重音字符和其他非标准字符。我的默认系统编码是'ascii'，我相信我使用 numpy.matrix 和 numpy.chararray 可能会出现问题，这可能会改变我的字符串的编码。

当我在 Sublime Text 2 中构建代码并在下面进行测试时，我得到：

SyntaxError: Non-ASCII character '\xc3'... but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

当我添加时

#!/usr/bin/env python
#-*- coding: utf-8 -*-

到它在 ST2 中运行的代码，但它也吐出一个错误，并且解码的消息缺少某些字符，如下所示：

[Decode error - output not utf-8]

La cr  a del le  n tiene dos a  os.

当我使用 Flask 在本地服务器上运行它时，我得到：

UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 0: ordinal not in range(128)

我尝试了 chardet 包，矩阵中的项目被识别为“windows-1252”。我使用“windows-1252”和“cp1252”解码矩阵中的项目，但问题仍然存在。在之前的解码之后，我尝试使用“utf-8”进行编码（即使用“windows-1252”），但它不起作用。我怀疑这是一个编码问题，但我并不完全确定。非常感谢任何有关如何解决此问题的线索。

这是代码：

import numpy as np
import random, string, re

def encode(message, size, token):
        """Assumes message is a string, size is the size limit of the message,
        and token is a string with unique characters, i.e. bufalo but not rana"""

        message = list(message)

        while len(message) < size:
            sgn = random.choice(['*', '?', '&', '@'])
            message.append(sgn)

        matrix = np.matrix(message)
        cols = size/5

        matrix = matrix.reshape((cols, 5)).T
        encoded = np.chararray(shape=(cols,5)).T

        token = token.lower()
        token = list(token)
        new = []
        for i in token:
            new.append(sorted(token).index(i))

        while len(new) > 5:
            for i in new:
                if i >= (5):
                    new.remove(i)

        old = range(0,5)

        for o, n in zip(old, new):
            encoded[np.ix_([n], range(0, matrix.shape[1]))] = matrix[np.ix_([o], range(0, matrix.shape[1]))]

        encoded_str = ''
        for i in range((encoded.size)):
            encoded_str += encoded.item(i)

        return encoded_str

#########################################
#THIS IS A TEST
#########################################
mssg = "La cría del león tiene dos años."
print encode(mssg, 120, 'bufalo')
#########################################

def decode(message, size, token):
        message = list(message)

        while len(message) < size:
            sgn = random.choice(['*', '?', '&', '@'])
            message.append(sgn)

        matrix = np.matrix(message)

        cols = size/5
        matrix = matrix.reshape((5, cols))

        token = token.lower()
        token = list(token)
        new = []
        for i in token:
            new.append(sorted(token).index(i))
        while len(new) > 5:
            for i in new:
                if i >= (5):
                    new.remove(i)
        old = range(0,5)

        decoded = np.chararray(shape=(cols,5)).T
        for n, o in zip(old, new):
            decoded[np.ix_([n], range(0, matrix.shape[1]))] = matrix[np.ix_([o], range(0, matrix.shape[1]))]

        decoded =decoded.T

        decoded_str = ''
        for i in range((decoded.size)):
            decoded_str += decoded.item(i)

        decoded_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', decoded_str)
        return decoded_str

【问题讨论】：

也许可以尝试定义您的字符串，例如：u'La cría del león tiene dos años.' — 这可能会有所帮助。

标签： python encoding utf-8

【解决方案1】：

修复代码需要做几件事

1) 由于您的代码包含 unicode 字符，因此添加 #-*- coding: utf-8 -*- 是有意义的

2) 测试字符串应该是一个 unicode 字符串。所以这条线应该变成

mssg = u"La cría del león tiene dos años."

3) encoded 数组（来自 encoded = np.chararray(shape=(cols,5)).T 行）默认为 ascii 字符串。您应该将行更改为

encoded = np.chararray(shape=(cols,5), unicode=true).T

即你需要添加参数unicode=true

然后代码将运行并打印此结果

lt a?@&*@*&&&*&*&*?&?*Lílnnss&*@&&*@&??&?&@**?aa  e .@?*@&&@?@?*@?@?&?cdeidñ*&??&?**@*@*@&*&?@reóeoo&**&?@?&&??&@@??&&

【讨论】：

谢谢，这很有帮助。我能够通过建议进行两项更改并在 decode() 函数结束之前添加 decoded_str = decoded_str.encode("utf-8") 来解决 Sublime Text 2 中的问题。我还必须摆脱 decoded_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', decoded_str) 而是使用字符串中的替换方法删除 ['*' , '?', '&', '@'] （在 utf-8 编码之后）。为了让它在 Flask 上工作，我删除了最终的 utf-8 编码并保留了新的替换方法，因为我得到了 UnicodeDecodeError。
很高兴知道！如果我的回答解决了您的问题，请采纳。
你知道这在 Flask 中起作用但在 ST2 中不起作用的原因是否与它们两个使用不同的默认编码有关？我相信 decoded_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', decoded_str) 正在剥离重要字符的 unicode，因此在打印时省略了重音符号和其他特殊字符。
对不起，不是 sublime text 专家