尽管可以在查看源代码中看到，但无法在谷歌搜索中抓取元素答案

【问题标题】：Can not scrape an element in google search although can be seen in view source尽管可以在查看源代码中看到，但无法在谷歌搜索中抓取元素
【发布时间】：2017-04-09 09:09:00
【问题描述】：

如果来自谷歌搜索的单词，我正在尝试抓取定义

https://www.google.co.in/search?q=define%20subtle#cns=1

虽然所有的意思和例子在我查看页面的源代码时都可以看到，但仍然无法抓取。

<div class="vk_gy">"his language expresses rich and subtle meanings"</div>

可以在源代码中看到，但 soup.find("div", class_='vk_gy') 返回 NONE。

【问题讨论】：

您确定您正在查看源代码而不是 Javascript 运行后生成的 DOM 吗？谷歌使用了大量的 Javascript，如果他们怀疑你在抓取，他们会很快阻止你。
是的，我只是右键单击页面并选择了查看源
DOM 还可能取决于您发送给 Google 的 User-Agent 字符串。您是否在脚本中输出了 DOM 并确保它看起来符合预期？
我在 Firefox 上的 DOM Inspector AddOn 中检查了 DOM，它就在那里。
但是，您还没有确认您在 Python 脚本中获得了相同的 DOM，对吧？您应该在那里输出并仔细检查。

标签： python web-scraping beautifulsoup

【解决方案1】：

确保将完整的 html 字符串加载到漂亮的汤中。你是如何抓取 html 的？谷歌不喜欢你抓取他们的网页。如果您可以将完整加载的 html 放入 python，您会发现您的命令应该可以工作。这是我的输出：

>>> print(soup.find("div", class_='vk_gy').prettify())
<div class="xpdxpnd vk_gy" data-mh="-1">
 <span>
  adjective:
  <b>
   subtle
  </b>
 </span>
 <span>
  ; comparative adjective:
  <b>
   subtler
  </b>
 </span>
 <span>
  ; superlative adjective:
  <b>
   subtlest
  </b>
 </span>
</div>

【讨论】：

【解决方案2】：

您正在寻找.ubHt5c CSS 选择器，例如：

examples = soup.select('.ubHt5c')
for example in examples:
   # other code..

# or 
for example in soup.select('.ubHt5c'):
    # other code..

# or list comprehension
examples = [example.text for example in soup.select('.ubHt5c')] # returns a list

确保您使用的是 user-agent，因为默认的 requests user-agent 是 python-requests，因此 Google 会阻止请求，因为它知道这是机器人而不是“真正的”用户访问，您将收到带有某种错误的不同 HTML。 User-agent 通过将此信息添加到 HTTP request headers 来伪造用户访问。

我写了一篇关于how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions的专门博客。

在请求headers 中传递user-agent：

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)

代码和full example in the online IDE：

import requests, lxml
from bs4 import BeautifulSoup

headers = {
  'User-agent':
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

params = {
  'q': 'swagger definition',
  'gl': 'us'
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

examples = [example.text for example in soup.select('.ubHt5c')]
print(examples)

# ['"he swaggered along the corridor"', '"they strolled around the camp with an exaggerated swagger"']

或者，您可以使用来自 SerpApi 的 Google Direct Answer Box API 来实现相同的目的。这是一个带有免费计划的付费 API。

您的情况的不同之处在于，您不知道如何使事情正常工作，然后随着时间的推移对其进行维护，相反，您只需要迭代结构化 JSON 并快速获取您想要的数据。

要集成的代码：

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google",
  "q": "swagger definition",
  "gl": "us",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

examples = results['answer_box']['examples']
print(examples)

# # ['"he swaggered along the corridor"', '"they strolled around the camp with an exaggerated swagger"']

免责声明，我为 SerpApi 工作。

【讨论】：