【问题标题】:how to parse the gaps?如何解析差距?
【发布时间】:2014-03-02 08:55:33
【问题描述】:

请帮助从 ebay 的页面获取价格。

在以下脚本中,我从两个特定页面获取价格。

import pprint
import requests
import lxml.etree
import lxml.html
import lxml.cssselect
import re


def get_doc(url):
    try:
        req = requests.get(url)
    except Exception:
        print('Error open. __', Exception)
    else:
        html = req.text
        doc = lxml.html.document_fromstring(html)
        return doc


for url in ['http://www.ebay.com/itm/DW-PDP-Concept-Pearlescent-White-Maple-Drumset-/121271668104?pt=US_Drums&hash=item1c3c5acd88', 'http://www.ebay.com/itm/LOT-OF-20-DRUM-SET-TUNING-KEYS-DW-TAMA-PEARL-SABIAN-and-OTHER-UNIQUE-KEYS-/291092068092?pt=US_Drums&hash=item43c67076fc']:
    doc = get_doc(url)
    title = doc.xpath('//h1[@id="itemTitle"]/text()')
    priceUSD = doc.xpath('//span[@itemprop="price"]/text()')
    print(title, priceUSD)

问题是第一页的价格有一个空格('&_n_b_s_p_;')。因此得到错误的 xpath 值 text()。它看起来如下:

['DW/PDP 概念珠光白枫木鼓组'] ['US $1\xa0200,00'] ['很多 20 个鼓组调音键!德威!塔玛!珍珠!萨比安!和别的 唯一键!!'] ['US $6,05']

附言 它的价格不正确:'US $1\xa0200,00'

【问题讨论】:

    标签: python xpath python-3.x lxml


    【解决方案1】:

    替换\xa0:

    priceUSD = [t.replace('\xa0', '') for t in
                doc.xpath('//span[@itemprop="price"]/text()')]
    

    顺便说一句,我没有修改就得到以下输出:

    ['DW/PDP Concept Pearlescent White Maple Drumset'] ['US $1,200.00']
    ['LOT OF 20 DRUM SET TUNING KEYS! DW! TAMA! PEARL! SABIAN! and OTHER UNIQUE KEYS!!'] ['US $6.05']
    

    【讨论】:

    • '//h1[@id="itemTitle"]/text()' => '//span[@itemprop="price"]/text()'
    • @Sergey,你是对的。我从本地副本中复制了错误的行。谢谢。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-09-08
    • 1970-01-01
    • 2017-09-07
    • 2018-02-22
    • 2017-09-30
    • 2014-02-16
    • 1970-01-01
    相关资源
    最近更新 更多