您正在寻找.ubHt5c CSS 选择器,例如:
examples = soup.select('.ubHt5c')
for example in examples:
# other code..
# or
for example in soup.select('.ubHt5c'):
# other code..
# or list comprehension
examples = [example.text for example in soup.select('.ubHt5c')] # returns a list
确保您使用的是 user-agent,因为默认的 requests user-agent 是 python-requests,因此 Google 会阻止请求,因为它知道这是机器人而不是“真正的”用户访问,您将收到带有某种错误的不同 HTML。 User-agent 通过将此信息添加到 HTTP request headers 来伪造用户访问。
我写了一篇关于how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions的专门博客。
在请求headers 中传递user-agent:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
代码和full example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'swagger definition',
'gl': 'us'
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
examples = [example.text for example in soup.select('.ubHt5c')]
print(examples)
# ['"he swaggered along the corridor"', '"they strolled around the camp with an exaggerated swagger"']
或者,您可以使用来自 SerpApi 的 Google Direct Answer Box API 来实现相同的目的。这是一个带有免费计划的付费 API。
您的情况的不同之处在于,您不知道如何使事情正常工作,然后随着时间的推移对其进行维护,相反,您只需要迭代结构化 JSON 并快速获取您想要的数据。
要集成的代码:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "swagger definition",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
examples = results['answer_box']['examples']
print(examples)
# # ['"he swaggered along the corridor"', '"they strolled around the camp with an exaggerated swagger"']
免责声明,我为 SerpApi 工作。