通过python拆分一个大的csv文件答案

【问题标题】：Split a large csv file through python通过python拆分一个大的csv文件
【发布时间】：2017-10-10 08:36:23
【问题描述】：

我有一个包含 500 万行的 csv 文件。我想将文件拆分为用户指定的行数。

已开发以下代码，但执行时间过长。谁能帮我优化代码。

import csv
print "Please delete the previous created files. If any."

filepath = raw_input("Enter the File path: ")

line_count = 0
filenum = 1
try:
    in_file = raw_input("Enter Input File name: ")
    if in_file[-4:] == ".csv":
        split_size = int(raw_input("Enter size: "))
        print "Split Size ---", split_size
        print in_file, " will split into", split_size, "rows per file named as OutPut-file_*.csv (* = 1,2,3 and so on)"
        with open (in_file,'r') as file1:
            row_count = 0
            reader = csv.reader(file1)
            for line in file1:
                #print line
            with open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a") as out_file:
                if row_count < split_size:
                    out_file.write(line)
                    row_count = row_count +1
                else:
                    filenum = filenum + 1
                    row_count = 0
            line_count = line_count+1
        print "Total Files Written --", filenum
     else:
        print "Please enter the Name of the file correctly."        
except IOError as e:
   print "Oops..! Please Enter correct file path values", e
except  ValueError:
   print "Oops..! Please Enter correct values"

我也试过不带"with open"

【问题讨论】：

比十万更传统的单位怎么样？;)
用不同的文件指针寻找不同的点并通过co-routine/gevent并行使用它们呢？
我还没有尝试过。你能帮忙吗？多线程或多任务在这里会有所帮助。
由于某种原因，您无法删除您的印度语单词？
@JamesZ 印度语之类的？？

标签： python csv

【解决方案1】：

哎呀！您一直在重新打开每一行的输出文件，当它是一项昂贵的操作时......您的代码可能会变成：

    ...
    with open (in_file,'r') as file1:
        row_count = 0
        #reader = csv.reader(file1)   # unused here
        out_file = open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a")
        for line in file1:
            #print line
            if row_count >= split_size:
                out_file.close()
                filenum = filenum + 1
                out_file = open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a")
                row_count = 0
            out_file.write(line)
            row_count = row_count +1
            line_count = line_count+1
        ...

理想情况下，您甚至应该在 try 块之前初始化 out_file = None，并确保在 except 块中与 if out_file is not None: out_file.close() 完全关闭

备注：此代码仅按行数拆分（与您的一样）。这意味着如果 csv 文件可以在引用的字段中包含换行符，则会给出错误的输出...

【讨论】：

哦..在那种情况下..我需要检查新行。对吗？
@user2597209：如果你想在引用的字段中允许换行，你将不得不使用 csv 阅读器解析输入文件，并使用 csv 写入器写入行，或者手动进行解析但是它很复杂，有很多极端情况。

【解决方案2】：

你绝对可以使用python的多处理模块。

这是我有一个包含 1,000,000 行的 csv 文件时获得的结果。

import time
from multiprocessing import Pool

def saving_csv_normally(start):
  out_file = open('out_normally/' + str(start/batch_size) + '.csv', 'w')
  for i in range(start, start+batch_size):
    out_file.write(arr[i])
  out_file.close()

def saving_csv_multi(start):
  out_file = open('out_multi/' + str(start/batch_size) + '.csv', 'w')
  for i in range(start, start+batch_size):
    out_file.write(arr[i])
  out_file.close()

def saving_csv_multi_async(start):
  out_file = open('out_multi_async/' + str(start/batch_size) + '.csv', 'w')
  for i in range(start, start+batch_size):
    out_file.write(arr[i])
  out_file.close()

with open('files/test.csv') as file:
  arr = file.readlines()

print "length of file : ", len(arr)

batch_size = 100 #split in number of rows

start = time.time()
for i in range(0, len(arr), batch_size):
  saving_csv_normally(i)
print "time taken normally : ", time.time()-start

#multiprocessing
p = Pool()
start = time.time()
p.map(saving_csv_multi, range(0, len(arr), batch_size), chunksize=len(arr)/4) #chunksize you can define as much as you want
print "time taken for multiprocessing : ", time.time()-start

# it does the same thing aynchronically
start = time.time()
for i in p.imap_unordered(saving_csv_multi_async, range(0, len(arr), batch_size), chunksize=len(arr)/4): 
  continue
print "time taken for multiprocessing async : ", time.time()-start

输出显示每个所花费的时间：

length of file :  1000000
time taken normally :  0.733881950378
time taken for multiprocessing :  0.508712053299
time taken for multiprocessing async :  0.471592903137

我已经定义了三个独立的函数，因为 p.map 中传递的函数只能有一个参数，而且我将 csv 文件存储在三个不同的文件夹中，这就是我编写三个函数的原因。

【讨论】：