如何使这个 Python2.6 函数与 Unicode 一起使用？答案

【问题标题】：How can I make this Python2.6 function work with Unicode?如何使这个 Python2.6 函数与 Unicode 一起使用？
【发布时间】：2011-04-15 15:58:30
【问题描述】：

我有这个功能，我根据在线 NLTK 书籍第 1 章中的材料进行了修改。它对我来说非常有用，但是尽管阅读了关于 Unicode 的章节，我还是像以前一样迷茫。

def openbookreturnvocab(book):
    fileopen = open(book)
    rawness = fileopen.read()
    tokens = nltk.wordpunct_tokenize(rawness)
    nltktext = nltk.Text(tokens)
    nltkwords = [w.lower() for w in nltktext]
    nltkvocab = sorted(set(nltkwords))
    return nltkvocab

当我前几天在 Also Sprach Zarathustra 上尝试它时，它在 o 和 u 上用 umlat 拼写单词。我相信你们中的一些人会知道为什么会这样。我也确信它很容易修复。我知道它只是与调用将标记重新编码为 unicode 字符串的函数有关。如果是这样，在我看来它可能根本不会发生在该函数定义中，而是在这里，我准备写入文件：

def jotindex(jotted, filename, readmethod):
    filemydata = open(filename, readmethod)
    jottedf = '\n'.join(jotted)
    filemydata.write(jottedf)
    filemydata.close()
    return 0

我听说我必须在从文件中读取字符串后将其编码为 unicode。我尝试像这样修改函数：

def openbookreturnvocab(book):
    fileopen = open(book)
    rawness = fileopen.read()
    unirawness = rawness.decode('utf-8')
    tokens = nltk.wordpunct_tokenize(unirawness)
    nltktext = nltk.Text(tokens)
    nltkwords = [w.lower() for w in nltktext]
    nltkvocab = sorted(set(nltkwords))
    return nltkvocab

但这带来了这个错误，当我在匈牙利语上使用它时。当我在德语上使用它时，我没有错误。

>>> import bookroutines
>>> elles1 = bookroutines.openbookreturnvocab("lk1-les1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bookroutines.py", line 9, in openbookreturnvocab
    nltktext = nltk.Text(tokens)
  File "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__
    self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)

我修复了像这样归档数据的函数：

def jotindex(jotted, filename, readmethod):
    filemydata = open(filename, readmethod)
    jottedf = u'\n'.join(jotted)
    filemydata.write(jottedf)
    filemydata.close()
    return 0

但是，当我尝试提交德语时，这带来了这个错误：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bookroutines.py", line 23, in jotindex
    filemydata.write(jottedf)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128)
>>>

...这是您尝试写入 u'\n'.join'ed 数据时得到的结果。

>>> jottedf = u'/n'.join(elles1)
>>> filemydata.write(jottedf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128)

【问题讨论】：

“它崩溃了”？真的吗？这可能意味着什么？你能提供例子吗？
那是“clöbbëred”，而不是“clobbered”。如果您使用的是 Python 2.x，那么几乎可以肯定，在第一个示例中，您处理的是八位字节（字节），而不是 Unicode 对象；要从读取文件中获取 Unicode 文本，您可以例如做open( route ).read().decode( 'utf-8' )（或该文件的任何编码）。也就是说，您的变量名令人难以置信。 filemydata tsts。 jotted??
这是我的意思。很抱歉没有包括示例，洛特先生。至于那些变量名，得把它们留长，这样我才能快速输入它们。（因为有时我很想使用 var31tud3 之类的东西，甚至是字母表中的一个字母，都不能很好地留在记忆中，无法正确地从我的手指上滚下来。）
@flow 我尝试按照您提到的方式编写它，它返回了与我上面提到的相同的错误，当我使用 Ivo 的建议时发生了这种情况。

标签： python unicode nlp python-2.6 nltk

【解决方案1】：

对于从文件中读取的每个字符串，如果您有 UTF-8 格式的文本，您可以通过调用 rawness.decode('utf-8') 将它们转换为 unicode。你最终会得到 unicode 对象。另外，我不知道什么是“记事本”，但您可能想确保它是一个 unicode 对象并改用 u'\n'.join(jotted)。

更新：

NLTK 库似乎不喜欢 unicode 对象。好的，那么您必须确保您使用的是带有 UTF-8 编码文本的 str 实例。试试这个：

tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text([token.encode('utf-8') for token in tokens])

还有这个：

jottedf = u'\n'.join(jotted)
filemydata.write(jottedf.encode('utf-8'))

但如果 joted 确实是一个 UTF-8 编码的 str 列表，那么你不需要这个，这应该足够了：

jottedf = '\n'.join(jotted)
filemydata.write(jottedf)

顺便说一句，NLTK 在 unicode 和编码方面似乎不是很谨慎（至少，演示）。最好小心并检查它是否正确处理了您的令牌。此外，这可能导致您收到匈牙利文本而不是德语文本的错误，检查您的编码。

【讨论】：

我添加了那个位并且得到了一个错误。我修改了问题以显示它和修改。
哇。 initiate euphoria...它们都是 UTF-8... 是的，.join 函数是正确的，但是 openbook 函数需要您的两个修复...谢谢！