如何使用 BeautifulSoup 从 html 中清除标签答案

【问题标题】：How to Clean tags from html using BeautifulSoup如何使用 BeautifulSoup 从 html 中清除标签
【发布时间】：2018-06-19 05:25:27
【问题描述】：

我正在尝试使用 NLTK 库来训练数据。我遵循一步一步的过程。我做了第一步，但是在做第二步时，我收到了以下错误：

TypeError: a bytes-like object is required, not 'list'

我已尽力纠正它，但我再次遇到同样的错误。

这是我的代码：

from bs4 import BeautifulSoup
import urllib.request 
response = urllib.request.urlopen('http://php.net/') 
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)

这是我的错误

C:\python\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 8 of the file E:/secure secure/chatbot-master/nltk.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))
Traceback (most recent call last):
  File "E:/secure secure/chatbot-master/nltk.py", line 8, in <module>
    soup = BeautifulSoup(html)
  File "C:\python\lib\site-packages\bs4\__init__.py", line 228, in __init__
    self._feed()
  File "C:\python\lib\site-packages\bs4\__init__.py", line 289, in _feed
    self.builder.feed(self.markup)
  File "C:\python\lib\site-packages\bs4\builder\_html5lib.py", line 72, in feed
    doc = parser.parse(markup, **extra_kwargs)
  File "C:\python\lib\site-packages\html5lib\html5parser.py", line 236, in parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "C:\python\lib\site-packages\html5lib\html5parser.py", line 89, in _parse
    parser=self, **kwargs)
  File "C:\python\lib\site-packages\html5lib\tokenizer.py", line 40, in __init__
    self.stream = HTMLInputStream(stream, encoding, parseMeta, useChardet)
  File "C:\python\lib\site-packages\html5lib\inputstream.py", line 148, in HTMLInputStream
    return HTMLBinaryInputStream(source, encoding, parseMeta, chardet)
  File "C:\python\lib\site-packages\html5lib\inputstream.py", line 416, in __init__
    self.rawStream = self.openStream(source)
  File "C:\python\lib\site-packages\html5lib\inputstream.py", line 453, in openStream
    stream = BytesIO(source)
TypeError: a bytes-like object is required, not 'list'

【问题讨论】：

你看过这个帖子吗：stackoverflow.com/questions/16206380/… ？你可以试试get_text：crummy.com/software/BeautifulSoup/bs4/doc/#get-text
我试过运行你的脚本，它返回的文本很好吗？你能发布详细的错误信息吗？
运行时遇到这样的错误
TypeError: a bytes-like object is required, not 'list'
脚本工作正常，请编辑问题并添加错误消息。

标签： python python-3.x beautifulsoup

【解决方案1】：

您可以通过实现一个简单的标签剥离器来实现它。

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)
    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""
            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)
            tag.replaceWith(s)
    return soup

html = "<p>Love, <b>Hate</b>, and <i>Hap<b>piness</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

结果是：

<p>Love, Hate, and Happiness</p>

【讨论】：

【解决方案2】：

您的代码按原样运行。

UserWarning: No parser was explicitly specified 是您的声明是 soup = BeautifulSoup(html)。

TypeError: a bytes-like object is required, not 'list' 错误可能是由于依赖关系问题造成的。

bs4 documentation 表示如果您不指定解析器，例如 BeautifulSoup(markup)，它将使用您系统上安装的最佳 HTML 解析器：

如果您不指定任何内容，您将获得已安装的最佳 HTML 解析器。 Beautiful Soup 将 lxml 的解析器评为最佳，然后是 html5lib，然后是 Python 的内置解析器。

在我的系统上，使用BeautifulSoup(html, "html.parser") 工作得很好，速度不错，没有任何警告。 html.parser 自带 Python 标准库。

文档还有summarizes各个解析器库的优缺点：

试试BeautifulSoup(html, "html.parser")。它应该可以工作。

如果你想要速度，你可以试试BeautifulSoup(html, "lxml")。如果你没有 lxml 的 HTML 解析器，在 Windows 上你可能想用pip install lxml 安装它。

【讨论】：

【解决方案3】：

对于任何寻找适用于 python 3 的答案的人

invalidTags = ['br','b','font']
def stripTags(html, invalid_tags):
    soup = BeautifulSoup(html, "lxml")

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = "::"
            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = stripTags(str(c), invalid_tags)
                s += str(c)
            tag.replaceWith(s)
    return soup

【讨论】：

嗯。我错过了什么？ NameError: name 'NavigableString' is not defined
出于某种原因，我不得不添加from bs4 import NavigableString。现在可以使用了！