【问题标题】:How to scrape related searches on google?如何在谷歌上抓取相关搜索?
【发布时间】:2022-11-23 15:57:48
【问题描述】:

当给定关键字列表时,我试图在谷歌上搜索相关搜索,然后将这些相关搜索输出到一个 csv 文件中。我的问题是获取漂亮的汤来识别相关搜索 html 标签。

这是源代码中的示例 html 标记:

<div data-ved="2ahUKEwitr8CPkLT3AhVRVsAKHVF-C80QmoICKAV6BAgEEBE">iphone xr</div>

这是我的网络驱动程序设置:

from selenium import webdriver

user_agent = 'Chrome/100.0.4896.60'

webdriver_options = webdriver.ChromeOptions()
webdriver_options.add_argument('user-agent={0}'.format(user_agent))


capabilities = webdriver_options.to_capabilities()
capabilities["acceptSslCerts"] = True
capabilities["acceptInsecureCerts"] = True

这是我的代码:

queries = ["iphone"]

driver = webdriver.Chrome(options=webdriver_options, desired_capabilities=capabilities, port=4444)

df2 = []

driver.get("https://google.com")
time.sleep(3)
driver.find_element(By.CSS_SELECTOR, "[aria-label='Agree to the use of cookies and other data for the purposes described']").click()

# get_current_related_searches
for query in queries:
    driver.get("https://google.com/search?q=" + query)
    time.sleep(3)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    p = soup.find_all('div data-ved')
    print(p)
    d = pd.DataFrame({'loop': 1, 'source': query, 'from': query, 'to': [s.text for s in p]})
    terms = d["to"]
    df2.append(d)
    time.sleep(3)

df = pd.concat(df2).reset_index(drop=False)

df.to_csv("related_searches.csv")

它的 p=soup.find_all 是不正确的我只是不确定如何让 BS 识别这些特定的 html 标签。任何帮助都会很棒:)

【问题讨论】:

  • 谷歌不允许抓取并且它的 html 代码是高度动态的(生成的类等),所以它没有帮助。我不鼓励尝试抓取 Google 并找到 API 替代品
  • 好的,谢谢提醒,关于好的 api 有什么建议吗?
  • 使用谷歌的 API。

标签: python selenium google-chrome web-scraping beautifulsoup


【解决方案1】:

@jakecohensol,正如您所指出的,p = soup.find_all 中的选择器是错误的。正确的 CSS 选择器:.y6Uyqe .AB4Wff

Chrome/100.0.4896.60 User-Agent 标头不正确。 Google 会阻止带有此类代理字符串的请求。使用完整的 User-Agent 字符串,Google 会返回正确的 HTML 响应。

无需浏览器即可抓取 Google 相关搜索。它将更快、更可靠。

这是您的固定代码 sn-p (link to the full code in online IDE)

import time
import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 14526.89.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.133 Safari/537.36"
}

queries = ["iphone", "pixel", "samsung"]

df2 = []

# get_current_related_searches
for query in queries:
    params = {"q": query}
    response = requests.get("https://google.com/search", params=params, headers=headers)

    soup = BeautifulSoup(response.text, "html.parser")

    p = soup.select(".y6Uyqe .AB4Wff")

    d = pd.DataFrame(
        {"loop": 1, "source": query, "from": query, "to": [s.text for s in p]}
    )

    terms = d["to"]
    df2.append(d)

    time.sleep(3)

df = pd.concat(df2).reset_index(drop=False)

df.to_csv("related_searches.csv")

示例输出:

,index,loop,source,from,to
0,0,1,iphone,iphone,iphone 13
1,1,1,iphone,iphone,iphone 12
2,2,1,iphone,iphone,iphone x
3,3,1,iphone,iphone,iphone 8
4,4,1,iphone,iphone,iphone 7
5,5,1,iphone,iphone,iphone xr
6,6,1,iphone,iphone,find my iphone
7,0,1,pixel,pixel,pixel 6
8,1,1,pixel,pixel,google pixel
9,2,1,pixel,pixel,pixel phone
10,3,1,pixel,pixel,pixel 6 pro
11,4,1,pixel,pixel,pixel 3
12,5,1,pixel,pixel,google pixel price
13,6,1,pixel,pixel,pixel 6 release date
14,0,1,samsung,samsung,samsung galaxy
15,1,1,samsung,samsung,samsung tv
16,2,1,samsung,samsung,samsung tablet
17,3,1,samsung,samsung,samsung account
18,4,1,samsung,samsung,samsung mobile
19,5,1,samsung,samsung,samsung store
20,6,1,samsung,samsung,samsung a21s
21,7,1,samsung,samsung,samsung login

【讨论】:

    【解决方案2】:

    查看SelectorGadget Chrome extension,通过单击浏览器中返回 HTML 元素的所需元素来获取 CSS 选择器。

    Check out what's your user agent,或find multiple user agents for mobile, tablet, PC, or different OS为了rotate user agents这减少了一点被阻止的机会。

    理想的场景是结合旋转用户代理和旋转代理(理想情况下是住宅),以及 CAPTCHA 求解器来解决最终会出现的 Google CAPTCHA。

    作为替代方案,如果您不想弄清楚如何从头开始创建和维护解析器,或者如何绕过 Google(或其他搜索引擎)的阻止,可以使用 Google Search Engine Results API 来抓取 Google 搜索结果。

    要集成的示例代码:

    import os
    from serpapi import GoogleSearch
    
    queries = [
        'banana',
        'minecraft',
        'apple stock',
        'how to create a apple pie'
    ]
    
    def serpapi_scrape_related_queries():
    
        related_searches = []
    
        for query in queries:
            print(f'extracting related queries from query: {query}')
    
            params = {
                'api_key': os.getenv('API_KEY'),  # your serpapi api key
                'device': 'desktop',              # device to retrive results from
                'engine': 'google',               # serpapi parsing engine
                'q': query,                       # search query
                'gl': 'us',                       # country of the search
                'hl': 'en'                        # language of the search
            }
    
            search = GoogleSearch(params)         # where data extracts on the backend
            results = search.get_dict()           # JSON -> dict
    
            for result in results['related_searches']:
                query = result['query']
                link = result['link']
    
                related_searches.append({
                    'query': query,
                    'link': link
                })
    
        pd.DataFrame(data=related_searches).to_csv('serpapi_related_queries.csv', index=False)
    
    serpapi_scrape_related_queries()
    

    部分数据帧输出:

                 query                                               link
    0  banana benefits  https://www.google.com/search?gl=us&hl=en&q=Ba...
    1  banana republic  https://www.google.com/search?gl=us&hl=en&q=Ba...
    2      banana tree  https://www.google.com/search?gl=us&hl=en&q=Ba...
    3   banana meaning  https://www.google.com/search?gl=us&hl=en&q=Ba...
    4     banana plant  https://www.google.com/search?gl=us&hl=en&q=Ba...
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-12-19
      • 2018-01-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多