Python 和 gettext 出现 UTF-8 错误答案

【问题标题】：UTF-8 error with Python and gettextPython 和 gettext 出现 UTF-8 错误
【发布时间】：2011-04-04 22:49:48
【问题描述】：

我在编辑器中使用 UTF-8，所以这里显示的所有字符串在文件中都是 UTF-8。

我有一个这样的 python 脚本：

# -*- coding: utf-8 -*-
...
parser = optparse.OptionParser(
  description=_('automates the dice rolling in the classic game "risk"'), 
  usage=_("usage: %prog attacking defending"))

然后我用 xgettext 把所有东西都拿出来，得到了一个 .pot 文件，可以归结为：

"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: 8bit\n"

#: auto_dice.py:16
msgid "automates the dice rolling in the classic game \"risk\""
msgstr ""

之后，我使用 msginit 得到了一个de.po，我这样填写：

"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

#: auto_dice.py:16
msgid "automates the dice rolling in the classic game \"risk\""
msgstr "automatisiert das Würfeln bei \"Risiko\""

运行脚本，出现以下错误：

  File "/usr/lib/python2.6/optparse.py", line 1664, in print_help
    file.write(self.format_help().encode(encoding, "replace"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 60: ordinal not in range(128)

我该如何解决这个问题？

【问题讨论】：

_("usage: %prog attacking defending")的类型是什么？即print type(_("usage: %prog attacking defending")) 打印什么？

标签： python localization gettext

【解决方案1】：

该错误意味着您已对字节字符串调用 encode，因此它会尝试使用系统默认编码（Python 2 上的 ascii）将其解码为 Unicode，然后使用您指定的任何内容重新编码。

通常，解决它的方法是在尝试使用字符串之前调用s.decode('utf-8')（或字符串的任何编码）。如果您只使用 unicode 文字，它也可能有效：u'automates...'（这取决于如何从 .po 文件中替换字符串，我不知道）。

这种令人困惑的行为在 Python 3 中得到了改进，它不会尝试将字节转换为 unicode，除非你明确告诉它这样做。

【讨论】：

u"literal" 不起作用，但 decode("utf-8") 起作用。不是很好，但很有效。

【解决方案2】：

我怀疑问题是由 _("string") 返回一个字节字符串而不是 Unicode 字符串引起的。

明显的解决方法是：

parser = optparse.OptionParser(
        description=_('automates the dice rolling in the classic game "risk"').decode('utf-8'),
        usage=_("usage: %prog attacking defending").decode('utf-8'))

但这感觉不对。

ugettext 或 install(True) 可能会有所帮助。

Python gettext docs 给出了这些例子：

import gettext
t = gettext.translation('spam', '/usr/share/locale')
_ = t.ugettext

或：

import gettext
gettext.install('myapplication', '/usr/share/locale', unicode=1)

我正在尝试重现您的问题，即使我使用install(unicode=1)，我也会返回一个字节字符串（str 类型）。

要么我使用的 gettext 不正确，要么我的 .po/.mo 文件中缺少字符编码声明。

当我知道更多时，我会更新。

xlt = _('automates the dice rolling in the classic game "risk"')
print type(xlt)
if isinstance(xlt, str):
    print 'gettext returned a str (wrong)'
    print xlt
    print xlt.decode('utf-8').encode('utf-8')
elif isinstance(xlt, unicode):
    print 'gettext returned a unicode (right)'
    print xlt.encode('utf-8')

（另一种可能性是在 .po 文件中使用转义符或 Unicode 代码点，但这听起来并不有趣。）

（或者您可以查看系统的 .po 文件以了解它们如何处理非 ASCII 字符。）

【讨论】：

使用dk = gettext.translation(....., languages=['dk']) 然后dk.install(unicode=True) 为我解决了这个问题。此解决方案支持更多语言。

【解决方案3】：

我对此并不熟悉，但它似乎是 2.6 中的一个已知错误，已在 2.7 中修复：

http://bugs.python.org/issue2931

如果您无法使用 2.7，请尝试以下解决方法：

http://mail.python.org/pipermail/python-dev/2006-May/065458.html

【讨论】：