BeautifulSoup4 无法正确打印。 Python3答案

【问题标题】：BeautifulSoup4 cannot get the printing right. Python3BeautifulSoup4 无法正确打印。 Python3
【发布时间】：2016-04-27 00:21:20
【问题描述】：

我目前正在学习 Python3，我正在为一些数据抓取一个站点，这工作正常，但是在打印出 p 标签时，我无法让它按我的预期工作。

import urllib
import lxml
from urllib import request
from bs4 import BeautifulSoup



data = urllib.request.urlopen('www.site.com').read()
soup = BeautifulSoup(data, 'lxml')
stat = soup.find('div', {'style' : 'padding-left: 10px';})
dialog = stat.findChildren('p')

for child in dialog:
    childtext = child.get_text()
    #have tried child.string aswell (exactly the same result)
    childlist.append(childtext.encode('utf-8', 'ignore')
    #Have tried with str(childtext.encode('utf-8', 'ignore'))

print (childlist)

一切正常，但打印是“字节”

b'This is a ptag.string'
b'\xc2\xa0 (probably &nbsp'
b'this is anotherone'

ascii 编码的真实示例文本：

b"Announcementb'Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"

请注意，Announcement 是 p，其余的是 p 标签下的“strong”。

使用 utf-8 编码的相同样本

b"Announcement\xc2\xa0\xe2\x80\x93\xc2\xa0b'Firefox users may encounter browser warnings encountering SSL SHA-1 "

我希望得到：

"Announcement"
(newline / new item in list)
"Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"

如您所见，不正确的字符在“ascii”中被删除，但有些是 &nbsp;，它破坏了一些换行符，我还没有弄清楚如何正确打印，而且 b 仍然在那里！

我真的不知道如何删除 b 并正确编码或解码。我已经尝试了所有可以搜索到的“解决方案”。

HTML 内容 = utf-8

我最不想在处理之前更改完整的数据，因为它会扰乱我的其他工作，我认为没有必要。

美化不起作用。

有什么建议吗？

【问题讨论】：

标签： python beautifulsoup python-3.5 bs4

【解决方案1】：

首先，您将获得b'stuff' 形式的输出，因为您正在调用.encode()，它返回一个bytes 对象。如果您想打印字符串以供阅读，请将它们保留为字符串！

作为猜测，我假设您希望从 HTML 中很好地打印字符串，就像在浏览器中看到的那样。为此，您需要解码 HTML 字符串编码，如 this SO answer 中所述，这对于 Python 3.5 意味着：

import html
html.unescape(childtext)

除其他外，这会将 HTML 字符串中的任何 &nbsp; 序列转换为 '\xa0' 字符，这些字符将打印为空格。但是，如果您想在这些字符上换行，尽管&nbsp; 的字面意思是“不间断空格”，您必须在打印之前用实际空格替换这些字符，例如使用x.replace('\xa0', ' ')。

【讨论】：