【问题标题】:removing elements from html using BeautifulSoup and Python 3使用 BeautifulSoup 和 Python 3 从 html 中删除元素
【发布时间】:2017-12-23 15:52:51
【问题描述】:

我正在从网络上抓取数据并尝试删除所有具有标签“div”和类“notes module”的元素,如下面的 html:

        <div class="notes module" role="complementary">
  <h3 class="heading">Notes:</h3>
    <ul class="associations">
        <li>
          Translation into Русский available: 
            <a href="/works/494195">Два-два-один Браво Бейкер</a> by <a rel="author" href="/users/dzenka/pseuds/dzenka">dzenka</a>, <a rel="author" href="/users/La_Ardilla/pseuds/La_Ardilla">La_Ardilla</a>
        </li>
    </ul>
    <blockquote class="userstuff">
      <p>
  <i>Warnings: numerous references to and glancing depictions of combat, injury, murder, and mutilation of the dead; deaths of minor and major original characters. Numerous explicit depictions of sex between two men.</i>
</p>
    </blockquote>
    <p class="jump">(See the end of the work for <a href="#children">other works inspired by this one</a>.)</p>
</div>

来源在这里:view-source:http://archiveofourown.org/works/180121?view_full_work=true

我什至很难找到并打印我想要删除的元素。到目前为止,我有:

import urllib.request, urllib.parse, urllib.error
from lxml import html
from bs4 import BeautifulSoup

url = 'http://archiveofourown.org/works/180121?view_full_work=true'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
removals = soup.find_all('div', {'id':'notes module'})
for match in removals:
    match.decompose()

但删除返回一个空列表。你能帮我选择上面显示的整个 div 元素,以便我可以从 html 中选择和删除所有这些元素吗?

谢谢。

【问题讨论】:

    标签: html python-3.x web-scraping beautifulsoup


    【解决方案1】:

    你试图找到的 div 有class = "notes module",但在你的代码中你试图通过id = "notes module" 找到那些 div。 更改这一行:

    removals = soup.find_all('div', {'id':'notes module'})
    

    到这里:

    removals = soup.find_all('div', {'class':'notes module'})
    

    【讨论】:

    • 感谢您的关注。不过,我仍然得到一个空列表。
    【解决方案2】:

    试一试。它将从该网页的class='wrapper' 下踢出所有可用的divs

    import requests
    from bs4 import BeautifulSoup
    
    html = requests.get('http://archiveofourown.org/works/180121?view_full_work=true')
    soup = BeautifulSoup(html.text, 'lxml')
    for item in soup.select(".wrapper"):
        [elem.extract() for elem in item("div")]
        print(item)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-12-19
      • 2022-12-13
      • 2020-03-08
      • 1970-01-01
      • 2011-01-04
      • 1970-01-01
      • 1970-01-01
      • 2013-04-18
      相关资源
      最近更新 更多