【问题标题】:Cant scrape google search results with beautifulsoup不能用beautifulsoup 抓取谷歌搜索结果
【发布时间】:2020-10-09 11:25:42
【问题描述】:

我想抓取谷歌搜索结果,但每当我尝试这样做时,程序都会返回一个空列表

from bs4 import BeautifulSoup
import requests

keyWord = input("Input Your KeyWord :")

url = f'https://www.google.com/search?q={keyWord}'
src = requests.get(url).text
soup = BeautifulSoup(src, 'lxml')

container = soup.findAll('div', class_='g')

print(container)

【问题讨论】:

  • 我的猜测是,由于 Google 没有为搜索服务提供任何 API,他们希望用户直接访问他们的页面来查看他们的广告。它们可能不会直接在页面上返回结果,但可能需要在页面上执行一些 JS 来添加结果(这就是简单页面加载不起作用的原因)。谷歌搜索抓取有一些解决方案,但它们依赖于某种形式的浏览器模拟和 JS 执行。所以我猜这个问题没有简单的答案。

标签: python-3.x web-scraping beautifulsoup


【解决方案1】:

要从 google 获取正确的结果页面,请指定 User-Agent http 标头。对于只有英文结果,请在 URL 中输入 hl=en 参数:

from bs4 import BeautifulSoup
import requests

keyWord = input("Input Your KeyWord :")

url = f'https://www.google.com/search?hl=en&q={keyWord}'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}

src = requests.get(url, headers=headers).text
soup = BeautifulSoup(src, 'lxml')

containers = soup.findAll('div', class_='g')

for c in containers:
    print(c.get_text(strip=True, separator=' '))

【讨论】:

  • 仍然为我返回空列表
【解决方案2】:

补充Andrej Kesely's 答案如果你得到空结果,你总是可以爬上一个div updown 来测试并从那里开始。

代码(比如你要抓取titlesummarylink):

from bs4 import BeautifulSoup
import requests
import json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=ice cream',
                    headers=headers).text

soup = BeautifulSoup(html, 'lxml')

summary = []

for container in soup.findAll('div', class_='tF2Cxc'):
  heading = container.find('h3', class_='LC20lb DKV0Md').text
  article_summary = container.find('span', class_='aCOpRe').text
  link = container.find('a')['href']

  summary.append({
      'Heading': heading,
      'Article Summary': article_summary,
      'Link': link,
  })

print(json.dumps(summary, indent=2, ensure_ascii=False))

输出部分:

[
  {
    "Heading": "Ice cream - Wikipedia",
    "Article Summary": "Ice cream (derived from earlier iced cream or cream ice) is a sweetened frozen food typically eaten as a snack or dessert. It may be made from dairy milk or cream and is flavoured with a sweetener, either sugar or an alternative, and any spice, such as cocoa or vanilla.",
    "Link": "https://en.wikipedia.org/wiki/Ice_cream"
  },
  {
    "Heading": "Jeni's Splendid Ice Creams",
    "Article Summary": "Jeni's Splendid Ice Cream, built from the ground up with superlative ingredients. Order online, visit a scoop shop, or find the closest place to buy Jeni's near you.",
    "Link": "https://jenis.com/"
  }
]

或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来执行此操作。这是一个付费 API,可免费试用 5,000 次搜索。查看Playground

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "ice cream",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")

部分输出:

Title: Ice cream - Wikipedia
Summary: Ice cream (derived from earlier iced cream or cream ice) is a sweetened frozen food typically eaten as a snack or dessert. It may be made from dairy milk or cream and is flavoured with a sweetener, either sugar or an alternative, and any spice, such as cocoa or vanilla.
Link: https://en.wikipedia.org/wiki/Ice_cream

Title: 6 Ice Cream Shops to Try in Salem, Massachusetts ...
Summary: 6 Ice Cream Shops to Try in Salem, Massachusetts · Maria's Sweet Somethings, 26 Front Street · Kakawa Chocolate House, 173 Essex Street · Melt ...
Link: https://www.salem.org/icecream/

Title: Melt Ice Cream - Salem
Summary: Homemade ice cream made on-site in Salem, MA. Bold innovative flavors, exceptional customer service, local ingredients.
Link: https://meltsalem.com/

免责声明,我为 SerpApi 工作。

【讨论】:

    猜你喜欢
    • 2018-01-15
    • 1970-01-01
    • 2015-11-23
    • 1970-01-01
    • 2018-12-19
    • 2021-01-17
    • 2020-05-03
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多