【问题标题】:Scraping products from a shopify site - unexpected results从 shopify 网站抓取产品 - 意外结果
【发布时间】:2021-01-07 14:49:40
【问题描述】:

所以我通常是编码新手,但对于我的第一个项目,我正在尝试创建一个监视器来监控 Shopify 网站的产品更改。

我的方法是在线获取公开共享的代码并从那里向后工作以理解它,所以我在一个更广泛的类中获得了以下代码,它似乎通过遍历页面来获取 products.json。

但是当我加载https://www.hanon-shop.com/collections/all/products.json 然后在下面打印我的项目列表时,前几个产品不同,这有什么意义?

def scrape_site(self):
        """
        Scrapes the specified Shopify site and adds items to array
        :return: None
        """
        self.items = []
        s = rq.Session()
        page = 1
        while page > 0:
            try:
                html = s.get(self.url + '?page=' + str(page) + '&limit=250', headers=self.headers, proxies=self.proxy, verify=False, timeout=20)
                output = json.loads(html.text)['products']
                if output == []:
                    page = 0
                else:
                    for product in output:
                        product_item = [{'title': product['title'], 'image': product['images'][0]['src'], 'handle': product['handle'], 'variants':product['variants']}]
                        self.items.append(product_item)
                    logging.info(msg='Successfully scraped site')
                    page += 1
            except Exception as e:
                logging.error(e)
                page = 0
            time.sleep(0.5)
        s.close()

【问题讨论】:

    标签: python html web-scraping python-requests shopify


    【解决方案1】:

    Requests 接受一个参数的字典,也有一个 json 方法,所以这可以更简洁。

    import time
    import requests
    
    
    def scrape_site(self):
        self.items = []
        page = 1
    
        with requests.Session() as s:
            while True:
                params = {
                  'page': page,
                  'limit': 250
                }
            
                try:
                    r = s.get(self.url, params=params, headers=self.headers, proxies=self.proxy, verify=False, timeout=20)
                    r.raise_for_status()
                    output = r.json()
                    if not output:
                        break
                    for product in output['products']:
                        product_item = {
                            'title': product['title'], 
                            'image': product['images'][0]['src'], 
                            'handle': product['handle'], 
                            'variants':product['variants']
                        }
                        self.items.append(product_item)
                    logging.info(f'Successfully scraped page {page}')
                    page += 1
                    time.sleep(1)
                    
                except Exception as e:
                    logging.error(e)
                    break
    
        return self.items
    

    【讨论】:

      猜你喜欢
      • 2021-06-28
      • 1970-01-01
      • 1970-01-01
      • 2011-09-18
      • 2015-03-14
      • 1970-01-01
      • 2021-06-19
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多