【问题标题】:Why are multiple prices being saved per product?为什么每个产品要保存多个价格?
【发布时间】:2016-01-03 01:42:01
【问题描述】:

我一直试图弄清楚为什么当数据保存在 csv 中时,这段代码会为每个产品生成多个价格。似乎产品所在页面上的行的所有价格都保存在该行中的每个产品下。显然,我想要做的只是为每个产品节省一个价格,而不是每个产品 3 或 4 个。

我自己无法弄清楚这一点。需要更改什么以便只存储每种产品的正确价格?

import mechanize
from lxml import html
import csv
import io
from time import sleep

def save_products (products, writer):

    for product in products:

        writer.writerow([ product["title"][0].encode('utf-8') ])
        for price in product['prices']:
            writer.writerow([ price["value"][0].encode('utf-8') ])

f_out = open('ssdResult.csv', 'wb')
writer = csv.writer(f_out)

links = ["http://sciencesuppliesdirect.com/research-chemicals", "http://sciencesuppliesdirect.com/research-chemicals?p=2", "http://sciencesuppliesdirect.com/research-chemicals?p=3","http://sciencesuppliesdirect.com/research-chemicals?p=4","http://sciencesuppliesdirect.com/research-chemicals?p=5","http://sciencesuppliesdirect.com/research-chemicals?p=6","http://sciencesuppliesdirect.com/research-chemicals?p=7","http://sciencesuppliesdirect.com/research-chemicals?p=8","http://sciencesuppliesdirect.com/research-chemicals?p=9","http://sciencesuppliesdirect.com/research-chemicals?p=10","http://sciencesuppliesdirect.com/research-chemicals?p=11","http://sciencesuppliesdirect.com/research-chemicals?p=12","http://sciencesuppliesdirect.com/research-chemicals?p=13","http://sciencesuppliesdirect.com/research-chemicals?p=14","http://sciencesuppliesdirect.com/research-chemicals?p=15","http://sciencesuppliesdirect.com/research-chemicals?p=16","http://sciencesuppliesdirect.com/research-chemicals?p=17","http://sciencesuppliesdirect.com/research-chemicals?p=18","http://sciencesuppliesdirect.com/research-chemicals?p=19","http://sciencesuppliesdirect.com/research-chemicals?p=20","http://sciencesuppliesdirect.com/research-chemicals?p=21","http://sciencesuppliesdirect.com/research-chemicals?p=22","http://sciencesuppliesdirect.com/research-chemicals?p=23","http://sciencesuppliesdirect.com/research-chemicals?p=24","http://sciencesuppliesdirect.com/cannabinoids","http://sciencesuppliesdirect.com/cannabinoids?p=2","http://sciencesuppliesdirect.com/cannabinoids?p=3","http://sciencesuppliesdirect.com/cannabinoids?p=4","http://sciencesuppliesdirect.com/cannabinoids?p=5","http://sciencesuppliesdirect.com/cannabinoids?p=6","http://sciencesuppliesdirect.com/cannabinoids?p=7","http://sciencesuppliesdirect.com/pellets","http://sciencesuppliesdirect.com/pellets?p=2","http://sciencesuppliesdirect.com/pellets?p=3","http://sciencesuppliesdirect.com/herbal-blends","http://sciencesuppliesdirect.com/herbal-blends?p=2","http://sciencesuppliesdirect.com/branded-products","http://sciencesuppliesdirect.com/branded-products?p=2"]

br = mechanize.Browser() 

for link in links:

    print(link)
    r = br.open(link)

    content = r.read()

    products = []        
    tree = html.fromstring(content)        
    product_nodes = tree.xpath('//div[@class="category-products"]/ul')

    for product_node in product_nodes:

        product = {}
        try:
            product['title'] = product_node.xpath('.//li/div[2]/h2/a/text()')

        except:
            product['title'] = ""

        price_nodes = product_node.xpath('.//li/div[2]/div[1]/span')

        product['prices'] = []
        for price_node in price_nodes:

            price = {}
            try:
                price['value'] = price_node.xpath('.//span/text()')

            except:
                price['value'] = ""


            product['prices'].append(price)
        products.append(product)
    save_products(products, writer)

f_out.close() 

【问题讨论】:

  • 什么意思?请包括输入 (html?) ,你得到什么输出。以及您的预期。
  • 输入是代码中的链接。如果你运行它,你会看到 csv 中的结果对每个项目都有多个价格,而在页面上每个项目只有一个价格。
  • 您好像存储了多个价格
  • 是的,没错。我试图弄清楚为什么我要存储多个价格,而不仅仅是每件商品的相应价格。
  • 没有人对这个问题有什么建议吗?

标签: python csv xpath web-scraping mechanize


【解决方案1】:

仔细查看您正在创建的数据结构。坦率地说,这是一团糟。
快速浏览一下,据我所知,它是这样的:

[
{
'prices': [{'value': [u'\xa35.00']}, {'value': [u'\xa35.00']}, {'value': [u'\xa36.00']}
],
'title': ['500mg Nitracaine', '5 x 4mg Flubromazepam Pellets', '1 Bk-2C-B Pellet', '10 0.5mg Pyrazolam Pellets']
}
]

一个列表,包含一组价格和标题,其中价格存储为包含列表集的列表,标题是一个列表。
我觉得!!!
光是看着就头疼。结果是您的 CSV 编写例程没有希望,也就是这些数据的结构化方式。您将不得不对其进行整理,以便有希望创造出您想要的东西。
另一件事是,即使您更改代码以将所有内容存储在可用结构中,您的代码也不允许old pricespecial-price,因为product_node.xpath('.//li/div[2]/div[1]/span') 不再是获得您想要的价格的正确方法,相反,它会返回一个子集,具体取决于第一个 old-price 所在的位置,因此价格数量与产品数量不匹配。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2011-03-02
    • 1970-01-01
    • 2017-05-22
    • 1970-01-01
    • 1970-01-01
    • 2015-11-05
    • 1970-01-01
    • 2021-07-12
    相关资源
    最近更新 更多