您可能发送了太多请求,或者 Google 将您的脚本检测为自动脚本。
您可以尝试做的第一件事是为您的请求添加代理:
#https://docs.python-requests.org/en/master/user/advanced/#proxies
proxies = {
'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}
或者您可以通过使用requests-html 或selenium 来渲染整个HTML 页面而不使用代理,但您仍然可以获得验证码。
使其工作的代码(我在本地测试了代码):
# If you get an empty array, you get an CAPTCHA from Google.
# Print response to see what cause it.
# Note: code below doesn't do pagination. https://requests-html.kennethreitz.org/#pagination
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security'
response = session.get(url)
# https://requests-html.kennethreitz.org/#requests_html.HTML.render
response.html.render(sleep=1)
for author_name in response.html.find('.gs_ai_name'):
name = author_name.text
print(name)
输出:
Johnson Thomas
Martin Abadi
Adrian Perrig
Vern Paxson
Frans Kaashoek
Mihir Bellare
Matei Zaharia
Helen J. Wang
Zhu Han
Sushil Jajodia
或者,您可以使用来自 SerpApi 的 Google Scholar Profiles API。这是一个付费 API,可试用 5,000 次搜索。目前正在开发完全免费的试用版。
主要区别在于您不必考虑解决验证码或体验缓慢的抓取过程,因为渲染页面或压力 PC 具有多个实例,例如使用selenium
要集成的代码:
from serpapi import GoogleSearch
params = {
"engine": "google_scholar_profiles",
"hl": "en",
"mauthors": "label:security",
"api_key": "YOUR_API_KEY"
}
search = GoogleSearch(params)
results = search.get_dict()
for author_name in results['profiles']:
name = author_name['name']
print(name)
输出:
Johnson Thomas
Martin Abadi
Adrian Perrig
Vern Paxson
Frans Kaashoek
Mihir Bellare
Matei Zaharia
Helen J. Wang
Zhu Han
Sushil Jajodia
部分 JSON 输出:
"profiles": [
{
"name": "Johnson Thomas",
"link": "https://scholar.google.com/citations?hl=en&user=eKLr0EgAAAAJ",
"serpapi_link": "https://serpapi.com/search.json?author_id=eKLr0EgAAAAJ&engine=google_scholar_author&hl=en",
"author_id": "eKLr0EgAAAAJ",
"affiliations": "Professor of Computer Science, Oklahoma State University",
"email": "Verified email at cs.okstate.edu",
"cited_by": 150263,
"interests": [
{
"title": "Security",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Asecurity",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:security"
}
]
}
]
免责声明,我为 SerpApi 工作。