【问题标题】:Exporting to CSV format incorrect in scrapy在scrapy中导出为CSV格式不正确
【发布时间】:2015-10-06 18:54:09
【问题描述】:

我正在尝试在使用 piplines 抓取后打印出一个 CSV 文件,但格式有点奇怪,因为不是从上到下打印,而是在抓取第 1 页和第 2 页的所有内容后一次打印一栏。我附上了 piplines.py 和 csv 输出中的一行(相当大)。那么我该如何从一页中一次打印列呢

管道.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter

class CSVPipeline(object):

    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline


    def spider_opened(self, spider):
        file = open('%s_items.csv' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.fields_to_export = ['names','stars','subjects','reviews']
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()


    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

和输出.csv

names   stars   subjects
Vivek0388,NikhilVashisth,DocSharad,Abhimanyu_swarup,Suresh N,kaushalhkapadia,JyotiMallick,Nitin T,mhdMumbai,SunilTukrel(COLUMN 2)   5 of 5 stars,4 of 5 stars,1 of 5 stars,5 of 5 stars,3 of 5 stars,4 of 5 stars,5 of 5 stars,5 of 5 stars,4 of 5 stars,4 of 5 stars(COLUMN 3) Best Stay,Awesome View... Nice Experience!,Highly mismanaged and dishonest.,A Wonderful Experience,Good place with average front office,Honeymoon,Awesome Resort,Amazing,ooty's beauty!!,Good stay and food

它应该看起来像这样

Vivek0388      5 of 5
NikhilVashisth 5 of 5
DocSharad      5 of 5
...so on

编辑:

items = [{'reviews:':"",'subjects:':"",'names:':"",'stars:':""} for k in range(1000)]
if(sites and len(sites) > 0):
    for site in sites:
        i+=1
        items[i]['names'] = item['names']
        items[i]['stars'] = item['stars']
        items[i]['subjects'] = item['subjects']
        items[i]['reviews'] = item['reviews']
        yield Request(url="http://tripadvisor.in" + site, callback=self.parse)
    for k in  range(1000):
        yield items[k]

【问题讨论】:

  • 忘了说我改变了设置
  • 您知道,您的刮板将所有名称作为列表存储在您的项目中吗? (我记得昨天的问题)。尝试将每个条目拆分为其单独的项目以获得所需的结果。您的所有条目也是如此:您的一个条目是条目列表。
  • 我试过了,但无济于事,我得到的只是一个空白文档。因为无论我在我的蜘蛛中定义什么,它都会被调用。但我认为我将转换为 JSON,然后将其转换为 CSV,因为我更习惯了。感谢您的帮助!
  • 没问题,但正如我所说,您应该在 Spider 本身中处理这些结果,然后它就会像魅力一样工作。
  • 我试过了,但我不断收到错误消息,说我需要返回 Item/Field() 我尝试返回一个字典,但我又遇到了一个错误。也没有作为它的递归调用起作用,因此它将重新定义删除它的字典。但我会再试一次,照你说的做。

标签: python csv web-scraping scrapy export-to-csv


【解决方案1】:

想通了, csv 压缩它,然后 for 循环它并写入行。阅读文档后,这会简单得多。

import csv
import itertools

class CSVPipeline(object):

   def __init__(self):
      self.csvwriter = csv.writer(open('items.csv', 'wb'), delimiter=',')
      self.csvwriter.writerow(['names','starts','subjects','reviews'])

   def process_item(self, item, ampa):

      rows = zip(item['names'],item['stars'],item['subjects'],item['reviews'])


      for row in rows:
         self.csvwriter.writerow(row)

      return item

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2015-07-08
    • 1970-01-01
    • 1970-01-01
    • 2012-02-15
    • 1970-01-01
    • 1970-01-01
    • 2018-09-13
    • 1970-01-01
    相关资源
    最近更新 更多