使用 BeautifulSoup4 (python 3.4) 删除所有 HTML 标签答案

【问题标题】：Removing all HTML tags using BeautifulSoup4 (python 3.4)使用 BeautifulSoup4 (python 3.4) 删除所有 HTML 标签
【发布时间】：2014-07-06 06:31:47
【问题描述】：

我一直在尝试解决这个问题，但我设法做到这一点的唯一方法是使用复杂的 while 循环。

我想输入以下内容：

"<td colspan='2' class='ToEx'>This is a test (<i> to see </i> this works) and I really hope it does</td>"

并输出：

"This is a test (to see if this works) and I really hope it does"

本质上，我想删除带有“”的所有内容以及介于两者之间的所有内容。我可以用几个命令做的最好的事情是：

"This is a test (<i> to see </i> this works) and I really hope it does"

但我只剩下这些讨厌的家伙了：<i></i>

这是我的代码：

from bs4 import BeautifulSoup

text = "<td colspan='2' class='ToEx'>This is a test (<i> to see </i> this works) and I really hope it does</td>" 
soup = BeautifulSoup(text)
content = soup.find_all("td","ToEx")
content[0].renderContents()

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

只需打印标签的.text 属性，它就会为您提供文本

print(content[0].text)

输出：

This is a test ( to see  this works) and I really hope it does

【讨论】：

我尝试的第一件事，似乎正在发生其他事情，因为我遇到了错误。我会再研究一下，非常感谢

【解决方案2】：

我会使用get_text() - 它是为这种情况设计的：

text = "<td colspan='2' class='ToEx'>This is a test (<i> to see </i> this works) and I really hope it does</td>" 
soup = BeautifulSoup(text)
print(soup.get_text())

这应该可以as per the documentation。

我以前从未见过 .text 使用过，相反，在 Beautiful Soup 4 中，使用 .string - 如果你想使用它：

text="<td colspan='2' class='ToEx'>This is a test (<i> to see </i> this works) and I really hope it does</td>"
soup = BeautifulSoup(text)

for string in soup.strings:
     print(str(string),end="")

两者都会输出：

这是一个测试（看看它是否有效），我真的希望它可以

两者都可以很好地工作，但get_text() 会更容易使用，尤其是如果您想将文本保存到变量等时。

【讨论】：