-
默认 Google 搜索地址开始不包含
# 符号。相反,它应该有? 和/search pathname:
---> https://google.com/#q=
---> https://www.google.com/search?q=cake
- 确保将
user-agent 传递到HTTP request headers 以使其工作,因为默认的python user-agent 是python-requests,并且站点可以识别它并阻止脚本。检查Robots.txt 了解更多信息。
这可能是您得到空结果的原因。 Check what's your user-agent。 List of user-agents。
Pass user-agent in request headers:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('YOUR_URL', headers=headers)
代码和example in the online IDE:
from bs4 import BeautifulSoup
import requests, json, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'tesla', # query
'gl': 'us', # country to search from
'hl': 'en', # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
data = []
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
# sometimes there's no description and we need to handle this exception
try:
snippet = result.select_one('#rso .lyLwlc').text
except: snippet = None
data.append({
'title': title,
'link': link,
'snippet': snippet,
})
print(json.dumps(data, indent=2, ensure_ascii=False))
-------------
'''
[
{
"title": "Tesla: Electric Cars, Solar & Clean Energy",
"link": "https://www.tesla.com/",
"snippet": "Tesla is accelerating the world's transition to sustainable energy with electric cars, solar and integrated renewable energy solutions for homes and ..."
},
{
"title": "Tesla, Inc. - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Tesla,_Inc.",
"snippet": "Tesla, Inc. is an American electric vehicle and clean energy company based in Palo Alto, California, United States. Tesla designs and manufactures electric ..."
},
{
"title": "Nikola Tesla - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Nikola_Tesla",
"snippet": "Nikola Tesla was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist best known for his contributions to the design of the ..."
}
]
'''
或者,您可以使用来自 SerpApi 的 Google Organic Results API 来实现相同的目的。这是一个付费 API,提供免费计划,仅用于测试 API。
您的情况的不同之处在于,您不必弄清楚为什么输出为空以及导致这种情况发生的原因,绕过来自 Google 或其他搜索引擎的块,然后随着时间的推移维护解析器。相反,您只需要从您想要的结构化 JSON 中快速获取数据。
要集成的代码:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "tesla",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")
----------
'''
Title: Tesla: Electric Cars, Solar & Clean Energy
Summary: Tesla is accelerating the world's transition to sustainable energy with electric cars, solar and integrated renewable energy solutions for homes and ...
Link: https://www.tesla.com/
Title: Tesla, Inc. - Wikipedia
Summary: Tesla, Inc. is an American electric vehicle and clean energy company based in Palo Alto, California, United States. Tesla designs and manufactures electric ...
Link: https://en.wikipedia.org/wiki/Tesla,_Inc.
'''
免责声明,我为 SerpApi 工作。