将所有 Unicode 字符视为单个字母答案

【问题标题】：Treat all Unicode characters as single letters将所有 Unicode 字符视为单个字母
【发布时间】：2015-12-17 17:36:59
【问题描述】：

我想创建一个程序，通过根据单词在单词中的第一个位置添加赋予单词的字母的值来计算单词的“值”（作为练习，我是 Python 新手）。
IE。 "foo" 将返回 5（如 'f' = 1，'o' = 2），"bar" 将返回 6（如 'b' = 1，'a' = 2，'r' = 3）。

到目前为止，这是我的代码：

# -*- coding: utf-8 -*-
 def ppn(word):
    word = list(word)
    cipher = dict()
    i = 1
    e = 0

    for letter in word:
        if letter not in cipher:
            cipher[letter] = i
            e += i
            i += 1
        else:
            e += cipher[letter]
    return ''.join(word) + ": " + str(e)


if __name__ == "__main__":
    print ppn(str(raw_input()))

它运行良好，但是对于包含 'ł'、'ą' 等字符的单词，它不会返回正确的值（我猜这是因为它首先将这些字母转换为 Unicode 代码）。有没有办法绕过它，让解释器把所有个字母当作单个字母处理？

【问题讨论】：

标签： python python-2.7

【解决方案1】：

将您的输入解码为 unicode，然后在任何地方使用 unicode，然后在输出时解码。

具体来说你需要改变

print ppn(str(raw_input()))

到

print ppn(raw_input().decode(sys.stdin.encoding))

这将对您的输入进行解码。那么你也需要改变

''.join(word) + ": " + str(e)

到

u''.join(word) + u': ' + unicode(e)

这会使您的所有代码在内部使用 unicode 对象。

Print 会将 unicode 正确编码为您的终端使用的任何编码，但您也可以根据需要指定它。

或者，您可以完全按照您已有的方式进行操作，但使用 python 3 运行它。

欲了解更多信息，请阅读这篇非常有用的talk on the subject

【讨论】：

感谢您的解决方案和其他信息。一定会调查的！

【解决方案2】：

用你的shell编码解码：

if __name__ == "__main__":
    import sys
    print ppn((raw_input()).decode(sys.stdin.encoding))

对于 Unix 系统，UTF-8 通常有效。在 Windows 上，情况可能会有所不同。为了节省使用sys.stdin.encoding。你永远不知道你的脚本将在哪里运行。

或者，甚至更好。切换到 Python 3：

# -*- coding: utf-8 -*-

import sys

assert sys.version_info.major > 2


def ppn(word):
    word = list(word)
    cipher = dict()
    i = 1
    e = 0

    for letter in word:
        if letter not in cipher:
            cipher[letter] = i
            e += i
            i += 1
        else:
            e += cipher[letter]
    return ''.join(word) + ": " + str(e)

if __name__ == "__main__":
    print(ppn(str(input())))

在 Python 3 中，字符串默认为 unicode。所以不需要解码业务。

【讨论】：

【解决方案3】：

到目前为止，所有答案都说明了要做什么，但没有说明发生了什么，所以这里有一些提示。

当您在 Python 2 中使用 raw_input() 时，会返回一个字节字符串（Python 3 上的 input() 行为不同）。大多数 unicode 字符不能表示为单个字节，因为 unicode 字符比可以用字节表示的值多。

ł 或 ą 等字符在使用 utf-8 或其他编码进行编码时，可以占用两个或更多字节：

>>> 'ł'
'\xc5\x82'
>>> 'ą'
'\xc4\x85'

您的原始程序将这两个字节解释为不同的字符，导致结果不正确。

Python 提供了字节字符串的替代方案：unicode 字符串。使用 unicode 字符串，一个字符完全显示为一个字符（字符串的内部表示是不透明的），不会出现您遇到的问题。

因此，将字节串解码为 unicode 字符串是可行的方法。

【讨论】：

这也是我的猜测，但是我不太清楚如何在 Python 中避免它。谢谢。