美丽的汤类型错误和正则表达式答案

【问题标题】：Beautiful Soup Type Error and Regex美丽的汤类型错误和正则表达式
【发布时间】：2013-04-09 15:28:30
【问题描述】：

我正在尝试查找给定页面上的所有电子邮件并使用正则表达式匹配它们。我正在使用 BeautifulSoup 来获取所有标签

email_re = re.compile('[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*')

email = soup.findAll("a")
for j in email:
    email = j.string
    for match in email_re.findall(email):
        outfile.write(match + "\n")
        print match

但是，当我运行我的脚本时，它的这一部分会得到一个 TypeError: expected string or buffer。我认为这是因为 email 是 BeautifulSoup 对象，而不是 python 字符串。我尝试使用 str() 或 str() 将其转换为字符串，并且都返回另一个错误：UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 9 ：序数不在范围内（128）。我能做些什么来解决这些错误，并实际运行我的脚本。我没主意了。请帮忙！

【问题讨论】：

您使用哪个 Python 版本？ 2.* 还是 3.*？
我使用的是 python 2.7
哪一行触发了错误？ outfile.write(match + "\n")?

标签： python regex beautifulsoup

【解决方案1】：

match 变量很可能具有 unicode 类型。要将其写入文件，需要使用某种编码对其进行编码。默认情况下，Python 尝试使用 ASCII 编码对其进行编码。请尝试以下方法：

outfile.write(match.encode('utf-8') + "\n")

您可能还想将 UTF-8 编码更改为您的 outfile 应该具有的编码。

还有一个不错的Unicode HOWTO for Python 2.x。但请注意，Python 3 有另一种更合乎逻辑的方法来处理 Unicode。

【讨论】：