【问题标题】:How to scrape 'People also ask' box from Google search?如何从谷歌搜索中抓取“人们也问”框?
【发布时间】:2019-03-27 16:40:41
【问题描述】:

我需要从 Google 上抓取“人们也提问”框以获取问题和答案。

我在谷歌上搜索,然后用 BeautifulSoup 抓取它。

import requests
from bs4 import BeautifulSoup
import html2text
import urllib.request

link = "https://www.google.com/search?client=firefox-b-d&source=hp&ei=v0mUXPu2ApTljwS6iLnABA&ei=lAyVXMPFCsaUsgXqmZT4DQ&q=what+is+java"

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page = requests.get(link ,headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
#For answers :
mydivs = soup.find_all('div', class_="ILfuVd NA6bn")

结果是一个空列表。我签入了 html 文件,答案实际上在该类下。我的代码有什么问题?

【问题讨论】:

  • 如果您找到了解决方案,能否更新问题?

标签: python-3.x web-scraping beautifulsoup


【解决方案1】:

people-also-ask 可能会对您有所帮助。

pip install people-also-ask

使用示例:

people_also_ask.get_related_questions("coffee", 5)

['How did coffee originate?',
    'Is coffee good for your health?',
  'Who brought coffee America?',
    'Who invented coffee?',
    'Why is coffee bad for you?',
    'Why is drinking coffee bad for you?']

【讨论】:

    【解决方案2】:

    当您在搜索框中输入文本时,Google 的首页会更新,因此您在向搜索页面发出简单请求时将无法获得结果。

    您可以在浏览器中转到https://google.com,打开“开发工具”面板(通常为 F12)并查看“网络”选项卡以查看对自动完成 API 发出的底层请求。

    看起来端点是https://www.google.com/complete/search?q=yourQueryHere&client=psy-ab,比HTML页面更容易查询:

    query = "what is java"
    res = requests.get("https://google.com/complete/search?client=psy-ab&q=" + query)
    print(res)
    

    此外,Google 可能不希望人们抓取此内容,因此如果您执行的请求过多,您可能会遇到速率限制。

    【讨论】:

    • 查看inspecter,可以看到html文件的class="LGOjhe"下面有结果。
    • 在您发出初始请求后,HTML 代码可能会被页面上运行的脚本更新。可能有一个脚本在搜索栏中捕获输入,向上述端点发出请求并更新 HTML 文档
    【解决方案3】:
    1. 要获得答案,您可以使用 selenium click 方法或其他可以模拟点击的库。
    2. 直接从Javascript中提取:
    3. 使用来自 SerpApi 的 Google Related Questions API。这是一个免费试用的付费 API。检查playground

    代码和example

    from serpapi import GoogleSearch
    import os
    
    params = {
      "engine": "google",
      "q": "what is java",
      "api_key": os.getenv("API_KEY"),
    }
    
    search = GoogleSearch(params)
    results = search.get_dict()
    
    for q_and_a in results['related_questions']:
      print(f"Question: {q_and_a['question']}\nAnswer: {q_and_a['snippet']}\n")
    

    输出:

    Question: What is Java and why do I need it?
    Answer: Java is a programming language and computing platform first released by Sun Microsystems in 1995. There are lots of applications and websites that will not work unless you have Java installed, and more are created every day. Java is fast, secure, and reliable.
    
    Question: What is Java used for?
    Answer: One of the most widely used programming languages, Java is used as the server-side language for most back-end development projects, including those involving big data and Android development. Java is also commonly used for desktop computing, other mobile computing, games, and numerical computing.Apr 12, 2019
    
    Question: What is Java in simple words?
    Answer: Java is a high-level programming language developed by Sun Microsystems. Instead, Java programs are interpreted by the Java Virtual Machine, or JVM, which runs on multiple platforms. ... This means all Java programs are multiplatform and can run on different platforms, including Macintosh, Windows, and Unix computers.Apr 19, 2012
    
    Question: What is Java and its types?
    Answer: The types of the Java programming language are divided into two categories: primitive types and reference types. The primitive types (§4.2) are the boolean type and the numeric types. The numeric types are the integral types byte , short , int , long , and char , and the floating-point types float and double .
    

    免责声明,我为 SerpApi 工作。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2013-03-23
      • 1970-01-01
      • 2018-12-19
      • 2022-11-23
      • 2022-08-12
      • 1970-01-01
      • 2012-08-19
      相关资源
      最近更新 更多