如何使用python抓取谷歌地图答案

【问题标题】：How to scrape google maps using python如何使用python抓取谷歌地图
【发布时间】：2018-06-09 21:27:38
【问题描述】：

我正在尝试使用 python 从谷歌地图中抓取某个地方的评论数量。例如，Pike's Landing 餐厅（见下面的谷歌地图 URL）有 162 条评论。我想在 python 中提取这个数字。

网址：https://www.google.com/maps?cid=15423079754231040967

我对 HTML 不是很熟悉，但是从互联网上的一些基本示例中，我编写了以下代码，但运行此代码后得到的是一个黑色变量。如果你能让我知道我在这方面做错了什么，那将不胜感激。

from urllib.request import urlopen
from bs4 import BeautifulSoup

quote_page ='https://www.google.com/maps?cid=15423079754231040967'
page = urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
price_box = soup.find_all('button',attrs={'class':'widget-pane-link'})
print(price_box.text)

【问题讨论】：

抓取完整的地图数据真的很难。为什么不尝试使用 API？
我不是要抓取完整的地图，只是在地图最左边的窗格上的特定数字。此外，截至目前，谷歌地图 api 不返回评论数量。
可以通过JavaScript添加，urllib+BeautifulSoup不能运行JavaScript。您可以使用Selenium 来控制将加载页面并运行 JavaScript 的 Web 浏览器。或者您可以尝试在一些 JavaScript 代码中查找此信息 - 直接在 HTML 或外部文件中 *.js。 JavaScript 也可以使用 AJAX/XHR 从不同的 url 加载数据，你可以尝试使用 Chrome/Firefox 中的 DevTool 来找到这个 url。 XHR 主要以 JSON 字符串的形式获取数据，您可以使用模块 json 轻松地将其转换为 python 字典
顺便说一句：谷歌使用 JavaScript 在页面上添加元素，但如果谷歌发现客户端不使用 JavaScript，那么它可以发送不需要 JavaScript 的页面，但元素大多位于不同的标签中类。因此，您可以在浏览器中关闭 JavaScript 并再次加载地图，以查看 BeautifulSoup 从 Google 获得了什么。或者您可以保存来自urlopen() 的文件数据并在网络浏览器或文本编辑器中打开此文件。
我对 selenium 或 Java 脚本不是很熟悉，但我可以肯定地研究一下。如果您建议我可以使用我使用的简单方法来抓取谷歌地图，还想符合吗？我希望对上面发布的代码 sn-p 进行微小的更改以实现我的目标。

标签： python html web-scraping beautifulsoup scrapy

【解决方案1】：

在纯 Python 和没有 API 的情况下很难做到这一点，这就是我的结尾（请注意，我在 url 的末尾添加了 &hl=en，以获得英文结果，而不是我的语言）：

import re
import requests
from ast import literal_eval

urls = [
'https://www.google.com/maps?cid=15423079754231040967&hl=en',
'https://www.google.com/maps?cid=16168151796978303235&hl=en']

for url in urls:
    for g in re.findall(r'\[\\"http.*?\d+ reviews?.*?]', requests.get(url).text):
        data = literal_eval(g.replace('null', 'None').replace('\\"', '"'))
        print(bytes(data[0], 'utf-8').decode('unicode_escape'))
        print(data[1])

打印：

http://www.google.com/search?q=Pike's+Landing,+4438+Airport+Way,+Fairbanks,+AK+99709,+USA&ludocid=15423079754231040967#lrd=0x51325b1733fa71bf:0xd609c9524d75cbc7,1
469 reviews
http://www.google.com/search?q=Sequoia+TreeScape,+Newmarket,+ON+L3Y+8R5,+Canada&ludocid=16168151796978303235#lrd=0x882ad2157062b6c3:0xe060d065957c4103,1
42 reviews

【讨论】：

【解决方案2】：

你需要查看页面的源代码并解析window.APP_INITIALIZATION_STATE变量块，在那里你会找到所有需要的数据。

或者，您可以使用来自 SerpApi 的 Google Maps Reviews API。

JSON 输出示例：

"place_results": {
  "title": "Pike's Landing",
  "data_id": "0x51325b1733fa71bf:0xd609c9524d75cbc7",
  "reviews_link": "https://serpapi.com/search.json?engine=google_maps_reviews&hl=en&place_id=0x51325b1733fa71bf%3A0xd609c9524d75cbc7",
  "gps_coordinates": {
    "latitude": 64.8299557,
    "longitude": -147.8488774
  },
  "place_id_search": "https://serpapi.com/search.json?data=%214m5%213m4%211s0x51325b1733fa71bf%3A0xd609c9524d75cbc7%218m2%213d64.8299557%214d-147.8488774&engine=google_maps&google_domain=google.com&hl=en&type=place",
  "thumbnail": "https://lh5.googleusercontent.com/p/AF1QipNtwheOCQ97QFrUNIwKYUoAPiV81rpiW5cIiQco=w152-h86-k-no",
  "rating": 3.9,
  "reviews": 839,
  "price": "$$",
  "type": [
    "American restaurant"
  ],
  "description": "Burgers, seafood, steak & river views. Pub fare alongside steak & seafood, served in a dining room with river views & a waterfront patio.",
  "service_options": {
    "dine_in": true,
    "curbside_pickup": true,
    "delivery": false
  }
}

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
    "engine": "google_maps",
    "type": "search",
    "q": "pike's landing",
    "ll": "@40.7455096,-74.0083012,14z",
    "google_domain": "google.com",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

reviews = results["place_results"]["reviews"]

print(reviews)

输出：

免责声明，我为 SerpApi 工作。

【讨论】：

【解决方案3】：

在没有浏览器或代理的情况下抓取 Google 地图将导致在几次成功请求后被阻止。因此，抓取 Google 的主要问题是处理 cookie 和 ReCaptcha。

这是一个很好的post，您可以在其中看到一个在 python 中使用 selenium 实现相同目的的示例。您启动浏览器并模拟用户在网站上的操作的总体思路。

另一种方法是使用一些可靠的第 3 方服务，该服务将为您完成所有工作并将结果返回给您。例如，您可以免费试用Outscraper's Reviews service。

from outscraper import ApiClient

api_client = ApiClient(api_key='SECRET_API_KEY')

# Get reviews of the specific place by id
result = api_client.google_maps_reviews('ChIJrc9T9fpYwokRdvjYRHT8nI4', reviewsLimit=20, language='en')

# Get reviews for places found by search query
result = api_client.google_maps_reviews('Memphis Seoul brooklyn usa', reviewsLimit=20, limit=500, language='en')

# Get only new reviews during last 24 hours
from datetime import datetime, timedelta
yesterday_timestamp = int((datetime.now() - timedelta(1)).timestamp())

result = api_client.google_maps_reviews(
    'ChIJrc9T9fpYwokRdvjYRHT8nI4', sort='newest', cutoff=yesterday_timestamp, reviewsLimit=100, language='en')

免责声明，我为 Outscraper 工作。

【讨论】：