要获取摘要,您可以使用bs4 提供的select_one() 方法,方法是选择CSS 选择器。您可以使用SelectorGadget Chrome 扩展程序或任何其他方式进行快速选择。
确保您使用的是 user-agent,否则,Google 可能会阻止您的请求,因为默认的 user-agent 将是 python-requests(如果您使用的是 requests 库)
虚假用户访问的user-agents列表。
从那里你可以使用select_one() 方法刮掉你想要的所有其他部分。请记住,只有在 Google 提供的情况下,您才能从 Knowladge 图表中抓取信息。您可以编写if 或try-except 语句来处理异常。
代码和full example:
from bs4 import BeautifulSoup
import requests
import lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=who is donald trump', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = soup.select_one('.Uo8X3b+ span').text
print(summary)
输出:
Donald John Trump is an American media personality and businessman who served as the 45th president of the United States from 2017 to 2021.
Born and raised in Queens, New York City, Trump attended Fordham University and the University of Pennsylvania, graduating with a bachelor's degree in 1968.
使用来自 SerpApi 的 Google Knowledge Graph API 的另一种方法。这是一个免费试用的付费 API。查看Playground 了解更多信息。
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "who is donald trump",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
summary = results["knowledge_graph"]['description']
print(summary)
输出:
Donald John Trump is an American media personality and businessman who served as the 45th president of the United States from 2017 to 2021.
Born and raised in Queens, New York City, Trump attended Fordham University and the University of Pennsylvania, graduating with a bachelor's degree in 1968.
免责声明我为 SerpApi 工作。