如何删除 BeautifulSoup 中所有不同的脚本标签？答案

【问题标题】：How can I remove all different script tags in BeautifulSoup?如何删除 BeautifulSoup 中所有不同的脚本标签？
【发布时间】：2015-10-08 05:41:09
【问题描述】：

我从 Web 链接抓取表格并希望通过删除所有脚本标记来重建表格。这是源代码。

response = requests.get(url)
soup = BeautifulSoup(response.text)
table = soup.find('table')

for row in table.find_all('tr') :                                                                                                                                                                                                                                                                                                                                                                                                     
    for col in row.find_all('td'):
        #remove all different script tags
        #col.replace_with('') 
        #col.decompose()  
        #col.extract()
        col = col.contents

如何删除所有不同的脚本标签？以关注单元格为例，其中包括标签a、br和td。

<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>

我的预期结果是：

Signal et Communication
Ingénierie Réseaux et Télécommunications

【问题讨论】：

看看这个：stackoverflow.com/questions/31462360/…

标签： python html beautifulsoup html-parsing

【解决方案1】：

你问的是get_text()：

如果您只想要文档或标签的文本部分，您可以使用 get_text() 方法。 它返回文档中或下面的所有文本标记，作为单个 Unicode 字符串

td = soup.find("td")
td.get_text()

注意.string 在这种情况下会返回None，因为td 有多个孩子：

如果一个标签包含多个东西，那么不清楚是什么 .string 应该引用，所以.string 被定义为None

演示：

>>> from bs4 import BeautifulSoup
>>> 
>>> soup = BeautifulSoup(u"""
... <td><a href="http://www.irit.fr/SC">Signal et Communication</a>
... <br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
... </td>
... """)
>>> 
>>> td = soup.td
>>> print td.string
None
>>> print td.get_text()
Signal et Communication
Ingénierie Réseaux et Télécommunications

【讨论】：

@SparkandShine 基本相同 (text = property(get_text))。你应该使用get_text() 方法。

【解决方案2】：

尝试调用 col.string。这只会给你文本。

【讨论】：

col.text和col.string有什么区别？
这里是文本帮助文档中的文本定义：text | Get all child strings, concatenated using the given separator. 和字符串If this tag has a single string child, return value is that string. If this tag has no children, or more than one child, return value is None. If this tag has one child tag, return value is the 'string' attribute of the child tag, recursively.
正如@alecxe 所说，由于多个孩子，它不起作用并返回None。