如何丢弃某些标签并继续其余的？答案

【问题标题】：How to discard certain tag and continue with the rest?如何丢弃某些标签并继续其余的？
【发布时间】：2018-04-20 01:05:03
【问题描述】：

我在 python 中编写了一个脚本来从一些 html 元素中抓取一些文本。当我执行我的脚本时，它会为我提供其中所有可用的文本。我不希望在p 标记中获取文本。几天前，当我浏览BeautifulSoup 文档时，我发现了一个方法.decompose()。虽然我不明白那是做什么的，但我想我可以试一试。但是，在执行时，我得到一个错误。

这是脚本：

html_elem ='''    
<div class="track">
    <p id="core">
        pop singer<span class="lnkcat"> intranet </span>
    </p>
    <p id="crude">
        songs<span class="lnkitm"> online </span>
    </p>
    <p id="evergreen">
        instrumental<span class="lnkapt"> hotline </span>
    </p>
    <a href="http://link" target="_blank">track one</a>
    <a href="http://link" target="_blank">track two</a>
    <a href="http://link" target="_blank">track three</a>
</div>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_elem, "lxml")
item = soup.find_all(class_="track")
# item.p.decompose()
for elem in item:
    print(elem.text.strip())

当我取消注释包含 .decompose() 的行并运行时出现此错误：

Traceback (most recent call last):
  File "C:\Users\ar\AppData\Local\Programs\Python\Python35-32\Social.py", line 28, in <module>
    item.p.decompose()
AttributeError: 'ResultSet' object has no attribute 'p'

顺便说一句，仅使用.find_all("a")，我可以获得所需的数据，但即使我选择track 类，我也希望知道/学习我只会得到a 标记中的文本，不包括@ 中的文本987654331@标签。

【问题讨论】：

我认为错误是因为 find_all() 返回一个列表。 item[0].p.decompose() 应该可以解决这个问题。
你快到了@Swakeert Jain。现在它丢弃了第一个 p 标记。剩下的两个p 标签呢？非常感谢您的收获。
for p in item[0]("p"): p.decompose() 这应该这样做。 stackoverflow.com/a/39904439/5561737

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

找到解决方案：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_elem, "lxml")
for item in soup.find_all(class_="track"):
    [elem.extract() for elem in soup('p')]
    print(item.text.strip())

它现在给了我以下结果：

track one
track two
track three

【讨论】：

【解决方案2】：

正如@Swakeert Jain 所说，您不能在整个代码中使用.decompose()。但是，您可以使用 for 循环逐个删除它们。

html_elem ='''    
<div class="track">
    <p id="core">
        pop singer<span class="lnkcat"> intranet </span>
    </p>
    <p id="crude">
        songs<span class="lnkitm"> online </span>
    </p>
    <p id="evergreen">
        instrumental<span class="lnkapt"> hotline </span>
    </p>
    <a href="http://link" target="_blank">track one</a>
    <a href="http://link" target="_blank">track two</a>
    <a href="http://link" target="_blank">track three</a>
</div>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_elem, "lxml")
item = soup.find_all(class_="track")
item[0].p.decompose() 
# decomposed by the index or loop through with a for loop and decompose each

for elem in item:
    print(elem)

【讨论】：

非常感谢您的回答 Elvir Muslic。问题是：它仍在丢弃三个标签中的第一个p。不是其余两个。
正如我在代码上方的评论中所说，您可以使用 for 循环丢弃代码中的每个 p。
我尝试使用 for 循环来实现这一点，但没有发现任何改进。你推荐的那个循环是怎样的？你不需要更新你的答案。只需在此处粘贴循环即可。顺便说一句，我尝试的是：for link in item:link.p.decompose().