【问题标题】:Scraping URLs in Beautiful Soup but not getting all links在 Beautiful Soup 中抓取 URL 但未获取所有链接
【发布时间】:2021-10-07 17:22:29
【问题描述】:

这是 html 的相关部分,正在被抓取:

<div class="blockSpoiler-content">
   <div class="contentSpoiler">
      <div class="link-box" id="62H" style="background-color: rgb(65, 120, 50);">
         <div class="status-box"><i class="working" title="Working"></i></div>
         <a rel="external" href="https://url1.net.html" target="_blank">Link1</a>
      </div>
      <div class="link-box" id="IFA" style="background-color: rgb(65, 120, 50);">
         <div class="status-box"><i class="working" title="Working"></i></div>
         <a rel="external" href="https://url2.net.html" target="_blank">Link2</a>
      </div>
      <div class="link-box" id="ruG" style="background-color: rgb(65, 120, 50);">
         <div class="status-box"><i class="working" title="Working"></i></div>
         <a rel="external" href="https://url3.com.html" target="_blank">Link3</a>
      </div>
      <div class="link-box" id="Bdf" style="background-color: rgb(65, 120, 50);">
         <div class="status-box"><i class="working" title="Working"></i></div>
         <a rel="external" href="https://url4.com" target="_blank">Link4</a>
      </div>
      <div class="link-box" id="1Da" style="background-color: rgb(65, 120, 50);">
         <div class="status-box"><i class="working" title="Working"></i></div>
         <a rel="external" href="https://url5.net.html" target="_blank">Link5</a>
      </div>
   </div>
</div>

我正在尝试获取这些 URL:

  1. https://url1.net.html
  2. https://url2.net.html
  3. https://url3.com.html
  4. https://url4.com
  5. https://url5.net.html

我尝试了不同的东西,但只到了这里 (本地文件仅用于测试目的,在网页抓取之前):

with open("mainLocalFile.html") as fp:
soup2 = BeautifulSoup(fp, 'html.parser')
links = soup2.find_all('div', class_='blockSpoiler-content')
# print(links)
for link in links:
    print(link)
    print(link.a)          # prints only the first tag
    print(link.a['href'])  # prints only the first URL

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    选择所有在blockSpoiler-content 类标签下的&lt;a&gt;(现在你只用.find_all 方法选择一个&lt;div class=blockSpoiler-content&gt;):

    for a in soup.select(".blockSpoiler-content a"):
        print(a["href"])
    

    打印:

    https://url1.net.html
    https://url2.net.html
    https://url3.com.html
    https://url4.com
    https://url5.net.html
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2023-04-02
      • 2021-03-27
      • 2021-01-08
      • 2017-03-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多