【问题标题】:Beautiful soup how to remove links *and* the link text from soup美丽的汤如何从汤中删除链接*和*链接文本
【发布时间】:2020-03-05 01:36:41
【问题描述】:

我正在使用漂亮的汤从网页中获取一些已清理的文本 - 没有 html,只是显示给用户的文本。但是,我真的不希望代码看到带有链接的文本作为可见文本。为了明确我的意思:

This text is the problem

以上文本链接到 Beautiful soup 文档。目前我剪掉了实际的链接,但“这个文本是问题”的文字仍然存在。理想情况下,我也想删除该文本。

【问题讨论】:

  • 所以你想从你的汤中排除所有<a>标签?
  • 是的!有没有内置的方法?
  • 是的。找到所有带有href=True<a> 标签,然后删除它们。请参阅下面的解决方案

标签: beautifulsoup href


【解决方案1】:

您可以使用href 提取<a> 标签。要么.extract()要么.decompose()

这里是完整的:

from bs4 import BeautifulSoup

html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
    </div>'''

soup = BeautifulSoup(html, 'html.parser')

p_tags = soup.find_all('p')

for each in p_tags:
    print (each.text)

输出:

I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: 
This text is the problem
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.

然后删除它:

from bs4 import BeautifulSoup

html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
    </div>'''

soup = BeautifulSoup(html, 'html.parser')

for a in soup.findAll('a', href=True):
    a.extract()

p_tags = soup.find_all('p')

for each in p_tags:
    print (each.text)

输出:

I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: 

The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.

你也可以使用.decompose():

from bs4 import BeautifulSoup

html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
    </div>'''

soup = BeautifulSoup(html, 'html.parser')

soup.a.decompose()

p_tags = soup.find_all('p')

for each in p_tags:
    print (each.text)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2015-08-01
    • 2015-06-09
    • 1970-01-01
    • 1970-01-01
    • 2021-09-08
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多