网页抓取时如何绕过不受支持的字符？答案

【问题标题】：How do I get around unsupported characters while web scraping?网页抓取时如何绕过不受支持的字符？
【发布时间】：2012-03-09 07:54:14
【问题描述】：

我正在使用 lxml 抓取网页。在某一时刻，我得到了一个表格单元格的内容。

# get last name
lastNameContainer = tableRow.xpath('./td[@class="lastName"]');
lastName = lastNameContainer[0].text

不幸的是，一个表格单元格的字符超出了 ASCII 的范围，从而产生了这个错误。

UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-7: ordinal not in range(128)

我尝试将此添加到我的 Python 文件的顶部，但无济于事。

#!/usr/bin/python
# -*- coding: utf-8 -*-

我怎样才能解决这个问题？我仍然想存储这个字符。顺便说一下，这个字符是 ♀ 还是 ♂，具体取决于表格行。

更新：我意识到当我将数据写入文件时会触发错误：

with open('myData.txt', 'w') as myFile:
    myFile.write(lastName + '\n')

奇怪的是，这仍然会产生上述错误。

with open('myData.txt', 'w') as myFile:
    myFile.write(lastName.decode('utf-8') + '\n')

【问题讨论】：

您正在编写的文件是用 ascii 编码的吗？当我将♀字符写入文件时，我没有遇到这个问题（我在 Ubuntu 中默认使用 UTF-8 写入）。
顺便说一下，# -*- coding: utf-8 -*-只是用来定义源代码（yourmodule.py）编码（python.org/dev/peps/pep-0263）
谢谢，如何更改写入文件的编码？

【解决方案1】：

lxml 需要它们的 unicode 字符串。当我收到该异常时，我使用decode('utf-8') 解决它。

即：E.doc('♀'.decode('utf-8'))

更新：

with open('myData.txt', 'w') as myFile:
      myFile.write(lastName + '\n')

奇怪的是，这仍然会产生上述错误。

with open('myData.txt', 'w') as myFile:
      myFile.write(lastName.decode('utf-8') + '\n')

还要注意，如果 lastName 是 unicode 并且您尝试编写 UTF-8 编码文件，您将需要以这种方式将其转换回 lastName.encode('utf-8')

with open('myData.txt', 'w') as myFile:
    myFile.write(lastName.encode('utf-8') + '\n')

【讨论】：

另外，我认为我的错误是在写入文件时触发的。
E 是 ElementMaker 的一个实例（构建元素的快捷方式）：lxml.de/api/lxml.builder.ElementMaker-class.html