【问题标题】:How to extract links from a page using Beautiful soup如何使用 Beautiful soup 从页面中提取链接
【发布时间】:2019-06-03 04:26:56
【问题描述】:

我有一个HTML Page,其中包含多个 div,例如:

<div class="post-info-wrap">
  <h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-1st-post/" title="Example of 1st post &#8211; Example 1 Post" rel="bookmark">sample post &#8211; example 1 post</a></h2>
  <div class="post-meta clearfix">

    <div class="post-info-wrap">
      <h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-2nd-post/" title="Example of 2nd post &#8211; Example 2 Post" rel="bookmark">sample post &#8211; example 2 post</a></h2>
      <div class="post-meta clearfix">

我需要使用 post-info-wrap 类获取所有 div 的值我是 BeautifulSoup 的新手

所以我需要这些网址:

我试过了:

import re
import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.example.com/blog/author/abc") 
data = r.content  # Content of response

soup = BeautifulSoup(data, "html.parser")
for link in soup.select('.post-info-wrap'):
   print link.find('a').attrs['href']

此代码似乎不起作用。我不熟悉美丽的汤。如何提取链接?

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    你可以使用soup.find_all:

    from bs4 import BeautifulSoup as soup
    r = [i.a['href'] for i in soup(html, 'html.parser').find_all('div', {'class':'post-info-wrap'})]
    

    输出:

    ['https://www.example.com/blog/111/this-is-1st-post/', 'https://www.example.com/blog/111/this-is-2nd-post/']
    

    【讨论】:

    • 代码不工作。我想这是因为你忘记了在 href 之前有一个

    【解决方案2】:

    link = i.find('a',href=True)总是不返回anchor tag (a),可能是返回NoneType,所以需要验证链接为None,继续for循环,否则获取链接href值。

    通过网址抓取链接:

    import re
    import requests
    from bs4 import BeautifulSoup
    r = requests.get("https://www.example.com/blog/author/abc")
    data = r.content  # Content of response
    soup = BeautifulSoup(data, "html.parser")
    
    for i in soup.find_all('div',{'class':'post-info-wrap'}):
       link = i.find('a',href=True)
       if link is None:
           continue
       print(link['href'])
    

    通过 HTML 抓取链接:

    from bs4 import BeautifulSoup
    html = '''<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-1st-post/" title="Example of 1st post &#8211; Example 1 Post" rel="bookmark">sample post &#8211; example 1 post</a></h2><div class="post-meta clearfix">
    <div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-2nd-post/" title="Example of 2nd post &#8211; Example 2 Post" rel="bookmark">sample post &#8211; example 2 post</a></h2><div class="post-meta clearfix">'''
    
    soup = BeautifulSoup(html, "html.parser")
    
    for i in soup.find_all('div',{'class':'post-info-wrap'}):
       link = i.find('a',href=True)
       if link is None:
           continue
       print(link['href'])
    

    更新:

    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    driver = webdriver.Chrome('/usr/bin/chromedriver')
    driver.get("https://www.example.com/blog/author/abc")
    
    soup = BeautifulSoup(driver.page_source, "html.parser")
    
    for i in soup.find_all('div', {'class': 'post-info-wrap'}):
        link = i.find('a', href=True)
        if link is None:
            continue
        print(link['href'])
    

    O/P:

    https://www.example.com/blog/911/article-1/
    https://www.example.com/blog/911/article-2/
    https://www.example.com/blog/911/article-3/
    https://www.example.com/blog/911/article-4/
    https://www.example.com/blog/random-blog/article-5/
    

    对于 chrome 浏览器:

    http://chromedriver.chromium.org/downloads

    为 chrome 浏览器安装网络驱动程序:

    https://christopher.su/2015/selenium-chromedriver-ubuntu/

    硒教程

    https://selenium-python.readthedocs.io/

    '/usr/bin/chromedriver'chrome webdriver 路径在哪里。

    【讨论】:

    • @seeker2345 https://www.example.com/blog/author/abc 页面 div 不包含 .post-info-wrap 类。
    • @seeker2345 您应该使用有效的网站 URL,即 div 包含 .post-info-wrap 类。
    • 查看示例中页面的输出。我猜代码不起作用,因为您忘记了在 href 之前有一个

    • @seeker2345 你应该尝试第二种解决方案会得到像这样的o / p example.com/blog/111/this-is-1st-post example.com/blog/111/this-is-2nd-post
    • @seeker2345 你的“getastra.com/blog/author/vikas” URL 动态呈现(js or ajax) 请求,你正在使用简单的请求模块来获取下载网站的 URL。 requests 模块只下载静态页面数据。您应该使用selenium automation 库来废弃动态数据。
    猜你喜欢
    • 2023-04-02
    • 2015-09-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-06-06
    • 2012-08-22
    • 2021-03-27
    相关资源
    最近更新 更多