【问题标题】:Why am I getting repetitive output while trying to scrape data from Google Scholar?为什么我在尝试从 Google Scholar 抓取数据时得到重复输出?
【发布时间】:2013-11-01 07:11:29
【问题描述】:

我正在尝试从 Google Scholar 的搜索结果中抓取 PDF 链接。我尝试根据 URL 的变化设置页面计数器,但是在前八个输出链接之后,我得到了重复的链接作为输出。

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
import requests


#modifying the url as per page
urlCounter = 0
while urlCounter <=30:
    urlPart1 = "http://scholar.google.com/scholar?start="
    urlPart2 = "&q=%22entity+resolution%22&hl=en&as_sdt=0,4"
    url = urlPart1 + str(urlCounter) + urlPart2
    page = urllib2.Request(url,None,{"User-Agent":"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"})
    resp = urllib2.urlopen(page)
    html = resp.read()
    soup = BeautifulSoup(html)
    urlCounter = urlCounter + 10

    recordCount = 0
    while recordCount <=9:
        recordPart1 = "gs_ggsW"
        finRecord = recordPart1 + str(recordCount)
        recordCount = recordCount+1

    #printing the links
        for link in soup.find_all('div', id = finRecord):
            linkstring = str(link)
            soup1 = BeautifulSoup(linkstring)
        for link in soup1.find_all('a'):
            print(link.get('href'))

【问题讨论】:

    标签: python web-scraping urllib2 google-scholar


    【解决方案1】:

    更改代码中的以下行:

    finRecord = recordPart1 + str(recordCount)
    

    finRecord = recordPart1 + str(recordCount+urlCounter-10)
    

    真正的问题:第一页的 div id 是 gs_ggsW[0-9],而第二页的 id 是 gs_ggsW[10-19]。这么漂亮的汤在第二页找不到链接。

    Python 的变量作用域可能会使来自其他语言(如 Java)的人感到困惑。执行下面的for循环后,变量link仍然存在。所以链接被引用到第一页的最后一个链接。

    for link in soup1.find_all('a'):
        print(link.get('href'))
    

    更新:

    Google 可能不提供某些论文的 pdf 下载链接,所以不能使用 id 来匹配每篇论文的链接。您可以使用 css 选择器将所有链接匹配在一起。

    soup = BeautifulSoup(html)
    urlCounter = urlCounter + 10
    for link in soup.select('div.gs_ttss a'):
        print(link.get('href'))
    

    【讨论】:

      【解决方案2】:

      查看SelectorGadget Chrome 扩展程序,通过在浏览器中单击所需元素来获取CSS 选择器。

      Code and example in the online IDE 提取 PDF:

      from bs4 import BeautifulSoup
      import requests, lxml
      
      params = {
          "q": "entity resolution", # search query
          "hl": "en"                # language
      }
      
      # https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
      headers = {
          "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
      }
      
      html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
      soup = BeautifulSoup(html.text, "lxml")
      
      for pdf_link in soup.select(".gs_or_ggsm a"):
        pdf_file_link = pdf_link["href"]
        print(pdf_file_link)
      
      
      # output from the first page:
      '''
      https://linqs.github.io/linqs-website/assets/resources/getoor-vldb12-slides.pdf
      http://ilpubs.stanford.edu:8090/859/1/2008-7.pdf
      https://drum.lib.umd.edu/bitstream/handle/1903/4241/umi-umd-4070.pdf;sequence=1
      https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.9535&rep=rep1&type=pdf
      https://arxiv.org/pdf/1208.1927
      https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6875&rep=rep1&type=pdf
      http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
      '''
      

      或者,您可以使用来自 SerpApi 的 Google Scholar Organic Results API 来实现相同的目的。这是一个带有免费计划的付费 API。

      主要区别在于您只需要从结构化 JSON 中获取数据,而不是弄清楚如何从 HTML 中提取数据,如何绕过搜索引擎的块。

      要集成的代码:

      from serpapi import GoogleSearch
      
      params = {
          "api_key": "YOUR_API_KEY",   # SerpApi API key
          "engine": "google_scholar",  # Google Scholar organic reuslts
          "q": "entity resolution",    # search query
          "hl": "en"                   # language
      }
      
      search = GoogleSearch(params)
      results = search.get_dict()
      
      for pdfs in results["organic_results"]:
          for link in pdfs.get("resources", []):
              pdf_link = link["link"]
              print(pdf_link)
      
      
      # output:
      '''
      https://linqs.github.io/linqs-website/assets/resources/getoor-vldb12-slides.pdf
      http://ilpubs.stanford.edu:8090/859/1/2008-7.pdf
      https://drum.lib.umd.edu/bitstream/handle/1903/4241/umi-umd-4070.pdf;sequence=1
      https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.9535&rep=rep1&type=pdf
      https://arxiv.org/pdf/1208.1927
      https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6875&rep=rep1&type=pdf
      http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
      '''
      

      如果您想从自然搜索结果中获取更多数据,我有专门的 Scrape Google Scholar with Python 博客文章。

      免责声明,我为 SerpApi 工作。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-04-16
        • 2020-02-28
        • 2017-05-10
        • 2021-12-23
        • 2019-07-18
        • 1970-01-01
        • 2021-01-11
        • 1970-01-01
        相关资源
        最近更新 更多