andrew_reece 在回答这个问题时的回答不起作用,即使具有正确类的h3 标记位于源代码中,它仍然会引发错误,例如获取验证码,因为 Google 将您的脚本检测为自动脚本。打印响应以查看消息。
发送太多请求后我得到了这个:
The block will expire shortly after those requests stop.
Sometimes you may be asked to solve the CAPTCHA
if you are using advanced terms that robots are known to use,
or sending requests very quickly.
您可以做的第一件事是为您的请求添加代理:
#https://docs.python-requests.org/en/master/user/advanced/#proxies
proxies = {
'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}
请求代码如下:
html = requests.get('google scholar link', headers=headers, proxies=proxies).text
或者您可以使用 requests-HTML 或 selenium 或 pyppeteer 使其工作,无需代理,只需渲染页面。
代码:
# If you'll get an empty array, this means you get a CAPTCHA.
from requests_html import HTMLSession
import json
session = HTMLSession()
response = session.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=vicia+faba&btnG=')
# https://requests-html.kennethreitz.org/#javascript-support
response.html.render()
results = []
# Container where data we need is located
for result in response.html.find('.gs_ri'):
title = result.find('.gs_rt', first = True).text
# print(title)
# converting dict of URLs to strings (see how it will be without next() iter())
url = next(iter(result.absolute_links))
# print(url)
results.append({
'title': title,
'url': url,
})
print(json.dumps(results, indent = 2, ensure_ascii = False))
部分输出:
[
{
"title": "Faba bean (Vicia faba L.)",
"url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
},
{
"title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
"url": "https://scholar.google.com/scholar?cluster=956029896799880103&hl=en&as_sdt=0,5"
}
]
基本上,您可以对来自SerpApi 的Google Scholar API 执行相同的操作。但是您不必渲染页面或使用浏览器自动化,例如selenium 从 Google Scholar 获取数据。获得比selenium 或reqests-html 更快的即时JSON 输出,无需考虑如何绕过Google 屏蔽。
这是一个付费 API,可试用 5,000 次搜索。目前正在开发完全免费的试用版。
要集成的代码:
from serpapi import GoogleSearch
import json
params = {
"api_key": "YOUR_API_KEY",
"engine": "google_scholar",
"q": "vicia faba",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
results_data = []
for result in results['organic_results']:
title = result['title']
url = result['link']
results_data.append({
'title': title,
'url': url,
})
print(json.dumps(results_data, indent = 2, ensure_ascii = False))
部分输出:
[
{
"title": "Faba bean (Vicia faba L.)",
"url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
},
{
"title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
"url": "https://www.sciencedirect.com/science/article/pii/S0378429009002512"
},
]
免责声明,我为 SerpApi 工作。