在 Python 中读取意大利语文本使用哪种编码？答案

【问题标题】：Which encoding to use for reading Italian text in Python?在 Python 中读取意大利语文本使用哪种编码？
【发布时间】：2013-04-04 23:35:38
【问题描述】：

我正在使用 Python Tools for Visual Studio 并阅读一些用意大利语编写的文件。试过 iso-8859-1、iso-8859-2、utf-8、utf-8-sig。 Notepad++ 以不带 BOM 的 UTF-8 格式打开文件。

content = fp.read()
words = content.decode("utf-8-sig").lower().split()
for w in words:
    p=''
    cur.execute('SELECT word FROM  multiwordnet.italian_lemma l, multiwordnet.italian_synset s where l.id = s.id and l.lemma="%s"' % w)

导致崩溃的字符串是C'è。（读作"c\'\xe3\xa8"）

使用 chardet 没有帮助

Traceback (most recent call last):
File "C:\Users\Tathagata\Documents\Visual Studio 2012\Projects\PythonApplicati
on4\PythonApplication4\PythonApplication4.py", line 344, in <module>
createSynsetDict()
File "C:\Users\Tathagata\Documents\Visual Studio 2012\Projects\PythonApplicati
on4\PythonApplication4\PythonApplication4.py", line 294, in createSynsetDict
cur.execute('SELECT word FROM  multiwordnet.italian_lemma l, multiwordnet.it
alian_synset s where l.id = s.id and l.lemma="%s"' % w)
File "C:\Python27\lib\site-packages\pymysql\cursors.py", line 117, in execute
self.errorhandler(self, exc, value)
File "C:\Python27\lib\site-packages\pymysql\connections.py", line 187, in defa
ulterrorhandler
raise Error(errorclass, errorvalue)
Error: (<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('ascii', u's\
x00\x00\x00\x03SELECT word FROM  multiwordnet.italian_lemma l, multiwordnet.ital 
ian_synset s where l.id = s.id and l.lemma="c\'\xe3\xa8"', 116, 118, 'ordinal no
t in range(128)'))

【问题讨论】：

How Do I Stop The Pain?
您使用的是哪个 DB-API 绑定？（即，哪个数据库驱动程序？）
...实际上，更重要的是，您的数据库库模块中的paramstyle 全局值是什么？（如果您不知道，只需识别模块，我们可以查找）。
查看@CharlesDuffy -s cmets 的完整代码和更多内容(gist.github.com/tathagata/5320310)

标签： python visual-studio-2010 encoding

【解决方案1】：

假设您的数据库的绑定变量样式是format...

content = fp.read()
words = content.decode("utf-8-sig").lower().split()
for w in words:
    p=''
    cur.execute('SELECT word FROM ' +
                'multiwordnet.italian_lemma l, ' +
                'multiwordnet.italian_synset s ' +
                'where l.id = s.id and l.lemma=%s', w)

请注意，我们没有在 SQL 字符串和传入的变量之间使用% 运算符，也没有在%s 周围加上内引号；相反，%s 是一个占位符，用于标识在 SQL 中应该替换单词的位置，并且我们将要替换该占位符的值作为单独的参数传递。遵循这种做法不仅可以避免您需要处理编码问题（如果您的参数作为 Python Unicode 字符串传递，则数据库绑定负责从那里获取它），还可以防止SQL injection 安全漏洞。

其他 Python 数据库库可能使用不同的占位符样式；阅读文档或检查您的模块级paramstyle 常量。（对于qmark，您的占位符应该是?；对于numeric，它应该是冒号前缀的数字（:1 用于第一个参数，:2 用于第二个参数等）

【讨论】：

非常感谢您的回复。我正在使用 PyMySQL[github.com/petehunt/PyMySQL/]，它有 paramstyle=format，这就是为什么代码一直工作直到它到达任何带有有趣字符的单词。如果我按照您的建议使用?，它会引发 KeyError[pastebin.com/V7T6xbkY]，即使对于使用%s 可读的单词也是如此。
@Tathagata 是的 -- 对于format，您应该使用%s 而不是?，但仍然使用逗号而不是% 运算符。我正在适当地更新答案。
@Tathagata ...顺便说一句，以后请避免使用pastebin.com链接；对于不使用 Adblock 的人来说，它充满了华丽的动画广告。
在 pastebin 上：哇，我不知道。它是如此不同的互联网视图，您可以在没有广告屏蔽的情况下获得！
虽然您的回答很有见地，但并没有解决问题。我通过将它扔进尝试接球来绕过它。非常感谢您投入的时间和精力，再次感谢！