Python 请求未能提供完整的响应答案

【问题标题】：Python Requests failed to provide entire responsePython 请求未能提供完整的响应
【发布时间】：2019-06-14 17:56:15
【问题描述】：

我目前正在学习网络抓取。今天我尝试在网上搜索 google.com 搜索。当我尝试使用 python 请求库发出 get 请求时，它并没有为我提供完整的响应。

例如，如果我调用此 URL https://www.google.com/search?q=tea+meaning 来获取单词 tea 的含义，那么在结果响应中它只显示名词内容而不是动词内容。

from bs4 import BeautifulSoup as bs
import requests as req

headers_Get = {
    'Host': 'www.google.com',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}

url = "https://www.google.com/search?q=tea+meaning"
response = req.get(url, headers=headers_Get)

data = response.text
soup = bs(data, "html.parser")

问题出在这汤上。它不包含动词内容。为什么会这样？

谢谢。

【问题讨论】：

什么是动词内容？
不包含动词内容，动词内容是什么意思？
打印(soup.prettify())
这是完整的响应。
茶不是动词，所以...？

标签： python web-scraping beautifulsoup python-requests

【解决方案1】：

问题是 Google 没有将搜索结果作为一页发回。您在浏览器中作为搜索结果看到的大部分内容都是单独的 AJAX 请求。您可能会在初始请求中获得一些部分数据，但它不一定与从常规浏览器中看到的内容相匹配。

要了解使用 Beautiful Soup 和 Requests 会看到什么，请尝试在关闭 JavaScript 的浏览器中打开该链接。

【讨论】：

【解决方案2】：

您应该选择要打印的<div>。您将获得整个页面。

import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/search?q=tea+meaning"
header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36'}
page=requests.get(url,headers=header)

soup=BeautifulSoup(page.content,'html.parser')
result = soup.select_one('div.vmod').get_text()
print(result)

此代码打印所有内容，包括动词。嘿，如果你想获得意义，https://developer.oxforddictionaries.com/ 有一个很好的 API，请尝试使用它

【讨论】：

【解决方案3】：

可能是因为在美国网站上由于某种原因没有显示此含义。但是您可以通过将 Google 搜索的国家/地区查询 gl param 更改为英国来使其工作。

您可以像这样传递查询params：

params = {
  "q": "tea meaning",  # query
  "gl": "uk"           # country to make search from
}
requests.get("YOUR_URL", params=params)

如果您只想获得一个定义，请使用select_one()：

word_meaning = soup.select_one(".sY7ric .sY7ric span").text
print(word_meaning)
# a hot drink made by infusing the dried crushed leaves of the tea plant in boiling water.

代码：

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "tea meaning",
  "gl": "uk"
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

word_meaning = soup.select_one(".sY7ric .sY7ric span").text
print(word_meaning)

# a hot drink made by infusing the dried crushed leaves of the tea plant in boiling water.

或者，您可以使用来自 SerpApi 的 Google Direct Answer Box API 来实现此目的。这是一个带有免费计划的付费 API。

您的情况的不同之处在于，您不必做任何与选择正确选择器或随着时间的推移维护解析器相关的事情，因为它已经为最终用户完成了。而你只需要从结构化的 JSON 字符串中获取你想要的数据。

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "tea meaning",
    "gl": "uk",
    "hl": "en",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

word_meaning = results['answer_box']['definitions']
print(word_meaning)

---------
'''
# list of definitions:
[
'a hot drink made by infusing the dried crushed leaves of the tea plant in boiling water.', 
'the dried leaves used to make tea.', 
'a hot drink made from the infused leaves, fruits, or flowers of other plants.',
'the evergreen shrub or small tree that produces tea leaves, native to South and eastern Asia and grown as a major cash crop.'
]
'''

免责声明，我为 SerpApi 工作。

【讨论】：