【问题标题】:Beautiful Soup won't let me parse a variable containing HTMLBeautiful Soup 不允许我解析包含 HTML 的变量
【发布时间】:2013-09-25 01:57:12
【问题描述】:

我正在尝试漂亮地打印我存储在变量中的 HTML 电子邮件,但我不断收到来自 BS4 的错误消息,说它需要一个字符串。

这是我的代码:

from bs4 import BeautifulSoup
import imaplib
import email


mail = imaplib.IMAP4_SSL('imap.gmail.com')

username = raw_input('USERNAME (email):')
password = raw_input('PASSWORD: ')

try:
    mail.login(username, password)
    print "Logged in as %r !" % username
except: 
    imaplib.error
    print "Log in failed."

mail.list()
# Out: list of "folders" aka labels in gmail.
mail.select("inbox") # connect to inbox.

result, data = mail.uid('search', None, '(FROM "tiffany@e.tiffany.com")')
latest_email_uid = data[0].split()[1]
result, data = mail.uid('fetch', latest_email_uid, '(RFC822)')
raw_email = data[0][1]

email_message = email.message_from_string(raw_email)

print email_message

html = email_message
soup = BeautifulSoup(html)
print soup.prettify()

这是我正在处理的打印的 HTML 电子邮件:http://pastebin.com/qfAHwkdV

这是我得到的错误:

Traceback (most recent call last):
  File "tiff.py", line 34, in <module>
    soup = BeautifulSoup(html)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/bs4/__init__.py", line 169, in __init__
    self.builder.prepare_markup(markup, from_encoding))
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/bs4/builder/_htmlparser.py", line 139, in prepare_markup
    dammit = UnicodeDammit(markup, try_encodings, is_html=True)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/bs4/dammit.py", line 203, in __init__
    self._detectEncoding(markup, is_html)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/bs4/dammit.py", line 372, in _detectEncoding
    xml_encoding_match = xml_encoding_re.match(xml_data)
TypeError: expected string or buffer

为什么我无法将 HTML 转换为变量以使用 BS4 解析?

谢谢

【问题讨论】:

    标签: python parsing email beautifulsoup


    【解决方案1】:

    根据the documentation on .message_from_string,这不会返回一个字符串,而是一个消息对象。 BeautifulSoup() 需要一个字符串(或缓冲区)。

    也许是soup = BeautifulSoup(str(html))soup = BeautifulSoup(unicode(html))

    【讨论】:

    • 感谢这工作!但它几乎从美化的电子邮件中删除了所有内容。有什么想法吗?
    • @user1887261 很抱歉,我从未使用过电子邮件模块
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-07-03
    • 2022-12-02
    • 2011-09-27
    • 1970-01-01
    • 1970-01-01
    • 2014-12-16
    相关资源
    最近更新 更多