从 shopify 网站抓取产品 - 意外结果答案

【问题标题】：Scraping products from a shopify site - unexpected results从 shopify 网站抓取产品 - 意外结果
【发布时间】：2021-01-07 14:49:40
【问题描述】：

所以我通常是编码新手，但对于我的第一个项目，我正在尝试创建一个监视器来监控 Shopify 网站的产品更改。

我的方法是在线获取公开共享的代码并从那里向后工作以理解它，所以我在一个更广泛的类中获得了以下代码，它似乎通过遍历页面来获取 products.json。

但是当我加载https://www.hanon-shop.com/collections/all/products.json 然后在下面打印我的项目列表时，前几个产品不同，这有什么意义？

def scrape_site(self):
        """
        Scrapes the specified Shopify site and adds items to array
        :return: None
        """
        self.items = []
        s = rq.Session()
        page = 1
        while page > 0:
            try:
                html = s.get(self.url + '?page=' + str(page) + '&limit=250', headers=self.headers, proxies=self.proxy, verify=False, timeout=20)
                output = json.loads(html.text)['products']
                if output == []:
                    page = 0
                else:
                    for product in output:
                        product_item = [{'title': product['title'], 'image': product['images'][0]['src'], 'handle': product['handle'], 'variants':product['variants']}]
                        self.items.append(product_item)
                    logging.info(msg='Successfully scraped site')
                    page += 1
            except Exception as e:
                logging.error(e)
                page = 0
            time.sleep(0.5)
        s.close()

【问题讨论】：

标签： python html web-scraping python-requests shopify

【解决方案1】：

Requests 接受一个参数的字典，也有一个 json 方法，所以这可以更简洁。

import time
import requests


def scrape_site(self):
    self.items = []
    page = 1

    with requests.Session() as s:
        while True:
            params = {
              'page': page,
              'limit': 250
            }
        
            try:
                r = s.get(self.url, params=params, headers=self.headers, proxies=self.proxy, verify=False, timeout=20)
                r.raise_for_status()
                output = r.json()
                if not output:
                    break
                for product in output['products']:
                    product_item = {
                        'title': product['title'], 
                        'image': product['images'][0]['src'], 
                        'handle': product['handle'], 
                        'variants':product['variants']
                    }
                    self.items.append(product_item)
                logging.info(f'Successfully scraped page {page}')
                page += 1
                time.sleep(1)
                
            except Exception as e:
                logging.error(e)
                break

    return self.items

【讨论】：