如何使用 python 抓取亚马逊的多个搜索结果页面？答案

【问题标题】：How can I scrape multiple search result page of amazon using python?如何使用 python 抓取亚马逊的多个搜索结果页面？
【发布时间】：2020-10-10 13:48:36
【问题描述】：

如何从亚马逊抓取多个搜索结果页面的详细信息？第 1 页可以正常工作，但其他页面不能正常工作，结果也不一样。

YML 文件详情：

products:
    css: 'div[data-component-type="s-search-result"]'
    xpath: null
    multiple: true
    type: Text
    children:
        title:
            css: 'h2 a.a-link-normal.a-text-normal'
            xpath: null
            type: Text
        url:
            css: 'h2 a.a-link-normal.a-text-normal'
            xpath: null
            type: Link
        rating:
            css: 'div.a-row.a-size-small span:nth-of-type(1)'
            xpath: null
            type: Attribute
            attribute: aria-label
        reviews:
            css: 'div.a-row.a-size-small span:nth-of-type(2)'
            xpath: null
            type: Attribute
            attribute: aria-label
        price:
            css: 'span.a-price:nth-of-type(1) span.a-offscreen'
            xpath: null
            type: Text

这是我正在使用的功能

from selectorlib import Extractor
import requests 
import json 
from time import sleep
e = Extractor.from_yaml_file('search_result.yml')

def scrape(url):  

    headers = {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.in/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }

    # Download the page using requests
    print("Downloading %s"%url)
    r = requests.get(url, headers=headers)
    # Simple check to check if page was blocked (Usually 503)
    if r.status_code > 500:
        if "To discuss automated access to Amazon data please contact" in r.text:
            print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)
        else:
            print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))
        return None
    # Pass the HTML of the page and create 
    return e.extract(r.text)
data = scrape('https://www.amazon.in/s?k=mobile')
print(data)

对于第一页，它可以正常工作，但是当点击下一页时，url 也在动态更改，包括 qid。

第二个链接示例：'https://www.amazon.in/s?k=mobile&page=2&qid=1602337497&ref=sr_pg_2'

当我尝试运行循环时，我正在制作这样的网址：'https://www.amazon.in/s?k=mobile&page={}'.format(i)。

它也给了我结果，但与我点击链接时得到的结果不同。

如何抓取亚马逊搜索结果的多页？

【问题讨论】：

标签： python web-scraping beautifulsoup amazon

【解决方案1】：

我能找到这个，它工作得很好，只需在页码上使用循环：

import requests as r
import json

page_number = 1
my_url = 'https://www.amazon.in/s/query?k=mobile&page={}&qid=1604103880&ref=sr_pg_{}'.format(page_number, page_number)

res = r.post(my_url, data={"customer-action": "pagination"}, headers={'User-Agent': 'Mozilla/5.0'})
rows = res.text.split("&&&")
for row in rows:
    html_content = ''
    try:
        array = eval(row)
        json_data = json.loads(json.dumps(array[2]))
        index = json_data["index"]
        html_content = json_data["html"]
    except:
        pass
    # Perform your research, note that some rows don't concern the products
    print(html_content)

【讨论】：