如何在 Python 中使用德语变音符号答案

【问题标题】：How to work with German umlaut characters in Python如何在 Python 中使用德语变音符号
【发布时间】：2017-10-18 19:45:07
【问题描述】：

我有一个包含德语短语的文本文件，我正在尝试删除非字母字符，而不删除元音变音字符。我已经看到了其他类似的问题，但似乎没有一个解决方案对我有用。在某些情况下，Python 似乎将元音变音字符视为两个字符，但 print 函数工作正常：

>>> ch = '\xc3\xbc'
>>> print(ch)
ü
>>> print(len(ch))
2
>>> print(list(ch))
['\xc3', '\xbc']

我删除非字母字符的代码是

import unicodedata
def strip_po(s):
    ''.join(x for x in s if unicodedata.category(x) != 'Po')
word = strip_po(word)

Traceback (most recent call last):
File "/home/ed/Desktop/Deutsch/test", line 17, in <module>
  word = strip_po(word)
File "/home/ed/Desktop/Deutsch/test", line 9, in strip_po
  ''.join(x for x in s if unicodedata.category(x) != 'Po')
File "/home/ed/Desktop/Deutsch/test", line 9, in <genexpr>
  ''.join(x for x in s if unicodedata.category(x) != 'Po')
TypeError: category() argument 1 must be unicode, not str

【问题讨论】：

你从哪里得到字符串？
你确定你在 Python 3 上运行它吗？
错误消息的“unicode, not str”位暗示您使用的是 Python 2，但您已将此问题标记为 Python 3。您实际使用的是哪一个？ Unicode 处理在两个版本之间存在显着差异。

标签： python utf-8 python-3.6

【解决方案1】：

我将假设您在这种情况下使用 Python2，因为我可以用 Py2 重现您的问题。

您不想对字节进行任何文本处理。 Python 2 str 类型实际上只是一个字节列表，这就是为什么 len 说你的角色是 2 个字节长。您想将这些字节转换为 unicode 类型。你可以这样做：

In [1]: '\xc3\xbc'.decode('utf8')
Out[1]: u'\xfc'

注意在上面运行 len 将产生 1，因为它现在只是一个 unicode 字符。现在您可以正常处理您的文本，并且该字符： unicodedata.category(u'\xfc') 属于 'Ll' 类别

您可能想要隐藏更多类别，而不仅仅是 Po。这里有一个完整的列表： https://en.wikipedia.org/wiki/Unicode_character_property

Python 内置的isalpha 方法在这里可能会对您有所帮助，但您希望类型首先为unicode，如上所示。 https://docs.python.org/2/library/stdtypes.html#str.isalpha

In [2]: u'\xfc'.isalpha()
Out[2]: True

【讨论】：

是的，原来我不小心使用了 python2。我正在使用 atom 文本编辑器对其进行编码，并使用一个包，以便我可以通过按 f5 来运行代码。我认为这是使用 python 3 运行它，但它使用 python 2 所以这就是我的答案。谢谢