使用 Python 和 BeautifulSoup 解析 Google Scholar 结果答案

【问题标题】：Parsing Google Scholar results with Python and BeautifulSoup使用 Python 和 BeautifulSoup 解析 Google Scholar 结果
【发布时间】：2018-05-27 19:42:38
【问题描述】：

给定在 Google Scholar 中的典型关键字搜索（见截图），我想得到一个字典，其中包含页面上出现的每个出版物的 title 和 url（例如.results = {'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells', 'url': 'https://www.nature.com/articles/338427a0'}。

要从 Google Scholar 检索结果页面，我使用以下代码：

from urllib import FancyURLopener, quote_plus
from bs4 import BeautifulSoup

class AppURLOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'

openurl = AppURLOpener().open
query = "Vicia faba"
url = 'https://scholar.google.com/scholar?q=' + quote_plus(query) + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
#print url
content = openurl(url).read()
page = BeautifulSoup(content, 'lxml')
print page

此代码以（非常丑陋的）HTML 格式正确返回结果页面。但是，我无法超越这一点，因为我不知道如何使用 BeautifulSoup（我不太熟悉）来解析结果页面并检索数据。

请注意，问题在于结果页面的数据解析和提取问题，而不是 Google Scholar 本身的问题，因为上述代码正确检索了结果页面。

谁能给点提示？提前致谢！

【问题讨论】：

标签： python beautifulsoup google-scholar

【解决方案1】：

检查页面内容显示搜索结果包含在h3 标记中，其属性为class="gs_rt"。您可以使用 BeautifulSoup 仅提取这些标签，然后从每个条目内的 <a> 标签中获取标题和 URL。将每个标题/URL 写入字典，并存储在字典列表中：

import requests
from bs4 import BeautifulSoup

query = "Vicia%20faba"
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'

content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
    results.append({"title": entry.a.text, "url": entry.a['href']})

输出：

[{'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells',
  'url': 'https://www.nature.com/articles/338427a0'},
 {'title': 'Hydrogen peroxide is involved in abscisic acid-induced stomatal closure in Vicia faba',
  'url': 'http://www.plantphysiol.org/content/126/4/1438.short'},
 ...]

注意：我使用requests 而不是urllib，因为我的urllib 不会加载FancyURLopener。但是 BeautifulSoup 的语法应该是一样的，不管你如何获取页面内容。

【讨论】：

亲爱的 andrew_reece，非常感谢！它完美地工作。事实上，使用 requests 似乎比使用 urrlib 更简单高效。

【解决方案2】：

andrew_reece 在回答这个问题时的回答不起作用，即使具有正确类的h3 标记位于源代码中，它仍然会引发错误，例如获取验证码，因为 Google 将您的脚本检测为自动脚本。打印响应以查看消息。

发送太多请求后我得到了这个：

The block will expire shortly after those requests stop.
Sometimes you may be asked to solve the CAPTCHA
if you are using advanced terms that robots are known to use, 
or sending requests very quickly.

您可以做的第一件事是为您的请求添加代理：

#https://docs.python-requests.org/en/master/user/advanced/#proxies

proxies = {
  'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}

请求代码如下：

html = requests.get('google scholar link', headers=headers, proxies=proxies).text

或者您可以使用 requests-HTML 或 selenium 或 pyppeteer 使其工作，无需代理，只需渲染页面。

代码：

# If you'll get an empty array, this means you get a CAPTCHA. 

from requests_html import HTMLSession
import json

session = HTMLSession()
response = session.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=vicia+faba&btnG=')

# https://requests-html.kennethreitz.org/#javascript-support
response.html.render()

results = []

# Container where data we need is located
for result in response.html.find('.gs_ri'):
    title = result.find('.gs_rt', first = True).text
    # print(title)
    
    # converting dict of URLs to strings (see how it will be without next() iter())
    url = next(iter(result.absolute_links))
    # print(url)

    results.append({
        'title': title,
        'url': url,
    })

print(json.dumps(results, indent = 2, ensure_ascii = False))

部分输出：

[
  {
    "title": "Faba bean (Vicia faba L.)",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
  },
  {
    "title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
    "url": "https://scholar.google.com/scholar?cluster=956029896799880103&hl=en&as_sdt=0,5"
  }
]

基本上，您可以对来自SerpApi 的Google Scholar API 执行相同的操作。但是您不必渲染页面或使用浏览器自动化，例如selenium 从 Google Scholar 获取数据。获得比selenium 或reqests-html 更快的即时JSON 输出，无需考虑如何绕过Google 屏蔽。

这是一个付费 API，可试用 5,000 次搜索。目前正在开发完全免费的试用版。

要集成的代码：

from serpapi import GoogleSearch
import json

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google_scholar",
  "q": "vicia faba",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

results_data = []

for result in results['organic_results']:
    title = result['title']
    url = result['link']

    results_data.append({
        'title': title,
        'url': url,
    })
    
print(json.dumps(results_data, indent = 2, ensure_ascii = False))

部分输出：

[
  {
    "title": "Faba bean (Vicia faba L.)",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
  },
  {
    "title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429009002512"
  },
]

免责声明，我为 SerpApi 工作。

【讨论】：

亲爱的 Dimitry Zub，非常感谢您的回答。其实我早就放弃了谷歌学术。我现在更喜欢使用完全开放的 PubMed API 来检索科学文献（谷歌和其他此类公司似乎很容易垄断）。干杯！