【问题标题】:scrape product specification from amazon using scrapy使用 scrapy 从亚马逊抓取产品规格
【发布时间】:2019-02-08 03:32:39
【问题描述】:

您好,我想在链接的产品页面上抓取产品规格表:https://www.amazon.com/dp/B07HJ41HCF,为此我在 scrapy 中编写了以下蜘蛛。

 def parse(self, response):
        item = GraingerItem()
        item['url'] = response.url
        item['proddescription'] = response.xpath('//*[@id="productDetails_detailBullets_sections1"]/td[1]/th/text()').extract()
        item['title'] = response.xpath('//*[@id="productTitle"]/text()').extract()[0].strip()
        try:
            item['sellername'] = response.xpath('//*[@id="bylineInfo"]/text()').extract()[0].strip()
        except IndexError:
            item['sellername'] = "No Seller Name"
        gg=[]
        cc= response.xpath('//*[@class="a-link-normal a-color-tertiary"]')
        for bb in cc:
            dd=bb.xpath('text()').extract()[0].strip()
            gg.append(dd)
            gg.append(">")
        qq=str(gg)
        qr=qq.replace("'","")
        qs = qr.replace(">]","")
        qt=qs.replace("[","")
        qu = qt.replace(",","")
        item['travlink'] = qu
        try:
            item['rating'] = response.xpath('//*[@id="acrPopover"]/span[1]/a/i[1]/span/text()').extract()[0].strip()
        except IndexError:
            item['rating'] = "Be the First one to review"
        try:
            item['Crreview'] = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract()[0].strip()
        except IndexError:
            item['Crreview'] = "Be the First one to review"
        dd = response.xpath('//*[@id="feature-bullets"]/ul')
        ft = []
        for i in range(2,40):
            q = str(i)
            trows ="li["+q+"]"
            xpathgiven = trows + "/span/text()"
            for bullets in dd:
                b1= bullets.xpath(xpathgiven).extract()
                for ac in b1:
                    ab = ac.replace("\xa0", "")
                ft.append(b1)
                ft.append(";")
            stft = str(ft)
            stft1 = stft.replace("';', [], ';'","")
            stft2 = stft1.replace("\\t","")
            stft3 = stft2.replace('\\n',"")
            stft4 = stft3.replace("'","")
            stft5 = stft4.replace("[","")
            stft6 = stft5.replace("]","")
            stft7 = stft6.replace(",","")
            item['feature'] = stft7
        description = []
        try:

            for i in range(1, 100):
                q1 = str(i)
                trows1 = "[" + q1 + "]"
                xpathgiven1 = "//*[@id='productDescription']/p/text()["+q1+"]"
                gg = response.xpath(xpathgiven1).extract()
                description.append(gg)
                description.append(";")
            stft = str(description)
            dsft1 = stft.replace("';', [], ';'", "")
            dsft2 = dsft1.replace("'], ';', ['", ";")
            dsft3 = dsft2.replace('\\n', "")
            dsft33 = dsft3.replace('\\t', "")
            dsft4 = dsft33.replace("'", "")
            dsft5 = dsft4.replace("[", "")
            dsft6 = dsft5.replace("]", "")
            dsft7 = dsft6.replace(",", "")
            item['Description'] = dsft7
        except IndexError:
            item['Description'] = "No Description"

在上面的代码中,一切正常,但 item['proddescription'] 确实会产生一个空列表,任何对上述内容的帮助将不胜感激

【问题讨论】:

  • 你试过在shell上解析它吗?你的爬虫返回数据了吗?
  • 请查看我详细解释过的关于类似问题的另一个线程。 stackoverflow.com/questions/54471844/…
  • @PS1212 有什么出路

标签: python-3.x web-scraping scrapy-spider


【解决方案1】:

适用于您的变体:

response.xpath('//*[@id="productDetails_detailBullets_sections1"]/tr/*/text()').re('(\w+[^\n]+)')

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2015-11-03
    • 2019-08-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-01-13
    • 1970-01-01
    相关资源
    最近更新 更多