【问题标题】:Changing pages on google search scraping更改谷歌搜索抓取的页面
【发布时间】:2022-08-17 05:00:55
【问题描述】:
from urllib import response
import requests
import urllib
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession

def get_source(url):
    \"\"\"Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    \"\"\"

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)
        
def scrape_google(query):

    query = urllib.parse.quote_plus(query)
    response = get_source(\"https://www.google.com/search?q=\" + query)

    links = list(response.html.absolute_links)
    google_domains = (\'https://www.google.\', 
                      \'https://google.\', 
                      \'https://webcache.googleusercontent.\', 
                      \'http://webcache.googleusercontent.\', 
                      \'https://policies.google.\',
                      \'https://support.google.\',
                      \'https://maps.google.\')

    for url in links[:]:
        if url.startswith(google_domains):
            links.remove(url)

    return links

def get_results(query):
    
    query = urllib.parse.quote_plus(query)
    response = get_source(\"https://www.google.co.uk/search?q=\" + query)
    
    return response

def parse_results(response):
    
    css_identifier_result = \".tF2Cxc\"
    css_identifier_title = \"h3\"
    css_identifier_link = \".yuRUbf a\"
    css_identifier_text = \".VwiC3b\"
    
    results = response.html.find(css_identifier_result)

    output = []
    
    for result in results:

        item = {
            \'title\': result.find(css_identifier_title, first=True).text,
            \'link\': result.find(css_identifier_link, first=True).attrs[\'href\'],
            \'text\': result.find(css_identifier_text, first=True).text
        }
        
        output.append(item)
        
    return output

def google_search(query):
    response = get_results(query)
    return parse_results(response)

我想在我的代码中添加一个部分来更改页面,但我找不到方法!有人可以帮忙吗?

  • 不要刮谷歌,使用他们的 API
  • 是的,但我不想使用 google api
  • 我不认为我将其列为一个选项,使用 Google 的搜索引擎 API,它也会变得更容易,您不需要解析任何内容,只需从字典中获取值
  • 这回答了你的问题了吗? Searching in Google with Python 不过,请阅读关于该问题的第二条评论,您应该再次使用他们的 API
  • 我最近遇到了一个和你类似的问题。我附上我的答案的链接:stackoverflow.com/a/72938742/18597245

标签: python web-scraping


【解决方案1】:

要使用分页抓取 Google 搜索,其中一种方法是使用 start URL 参数,默认情况下等于 00 表示第一页,10 表示第二页,依此类推。

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "Elon Musk",
    "hl": "en",         # language
    "gl": "us",         # country of the search, US -> USA
    "start": 0,         # number page by default up to 0
    "filter": 0         # shows more than 10 pages. By default up to ~10-15 if filter = 1.
}

此外,默认搜索结果返回约 10-15 页。要增加返回页数,您需要将filter 参数设置为0 并将其传递给将返回10+ 页的URL。基本上,这个参数定义了Similar ResultsOmitted Results 的过滤器。

当下一个按钮存在时,您需要将 ["start"] 参数值增加 10 以访问它存在的下一页 if,否则我们需要将 break 退出 while 循环:

if soup.select_one('.d6cvqb a[id=pnnext]'):
    params["start"] += 10
else:
    break

此外,请确保您使用 request headers user-agent 充当“真实”用户访问。因为默认 requests user-agentpython-requests 并且网站知道它很可能是发送请求的脚本。 Check what's your user-agent

代码和full example in online IDE

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "Elon Musk",
    "hl": "en",         # language
    "gl": "us",         # country of the search, US -> USA
    "start": 0,         # number page by default up to 0
    "filter": 0         # show all pages by default up to 10
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

while True:
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        title = f'Title: {result.select_one("h3").text}'
        link = f'Link: {result.select_one("a")["href"]}'

        print(title, link, sep="\n", end="\n\n")

    if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10
    else:
        break

输出:

Title: Elon Musk - Wikipedia
Link: https://en.wikipedia.org/wiki/Elon_Musk
Description: Elon Reeve Musk FRS is a business magnate and investor. He is the founder, CEO, and Chief Engineer at SpaceX; angel investor, CEO, and Product Architect of ...

Title: Elon Musk - Forbes
Link: https://www.forbes.com/profile/elon-musk/
Description: Elon Musk is working to revolutionize transportation both on Earth, through electric car maker Tesla -- and in space, via rocket producer SpaceX.

Title: Elon Musk - Tesla
Link: https://www.tesla.com/elon-musk
Description: As the co-founder and CEO of Tesla, Elon leads all product design, engineering and global manufacturing of the company's electric vehicles, battery products and ...

... other results from the 1st and subsequent pages.

Title: Elon Musk: Revolutionary private space entrepreneur
Link: https://www.space.com/18849-elon-musk.html
Description: Oct 30, 2021 — As the CEO of SpaceX and Tesla, Elon Musk is transforming transport in space and on Earth.

Title: Elon Musk: When I saw the Tesla CEO for who he really is.
Link: https://slate.com/technology/2022/05/elon-musk-tesla-twitter-fables.html
Description: May 27, 2022 — Overtaken by curiosity, I had decided to spend a long Memorial Day weekend in California's Central Valley to see if Elon Musk's latest bit ...

Title: Elon Musk on 'Giant Money Furnaces' and Not Letting Tesla ...
Link: https://www.barrons.com/articles/elon-musk-tesla-bankruptcy-twitter-51655982269
Description: Jun 23, 2022 — Tesla CEO Elon Musk spoke with a Tesla owners club. He talked bankruptcy, advertising, full self driving, and birthrates.

... other results from 10th page.

或者您可以使用 SerpApi pagination for Google Search results 这是一个 API。下面,我演示了一个关于对所有页面进行分页和提取链接的短代码 sn-p:

from serpapi import GoogleSearch
import os

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    "api_key": os.getenv("API_KEY"),    # your serpapi api key
    "engine": "google",                 # search engine
    "q": "Elon Musk",                   # search query
    "location": "Dallas",               # your location
    # other parameters
}

search = GoogleSearch(params)           # where data extraction happens on the SerpApi backend
pages = search.pagination()             # JSON -> Python dict

for page in pages:
    for result in page["organic_results"]:
        print(f"Title: {result['title']}\nLink: {result['link']}\n\n")

输出将是相同的。

免责声明,我为 SerpApi 工作。

【讨论】:

    猜你喜欢
    • 2018-12-19
    • 1970-01-01
    • 2020-05-03
    • 2021-07-20
    • 1970-01-01
    • 1970-01-01
    • 2021-01-17
    • 2023-03-29
    • 2022-09-09
    相关资源
    最近更新 更多