【发布时间】:2018-04-01 13:10:03
【问题描述】:
我是一名 Python 初学者,已经编写了一些基本脚本。我最近的挑战是获取一个非常大的 csv 文件 (10gb+) 并根据每行中特定变量的值将其拆分为多个较小的文件。
例如,文件可能如下所示:
Category,Title,Sales
"Books","Harry Potter",1441556
"Books","Lord of the Rings",14251154
"Series", "Breaking Bad",6246234
"Books","The Alchemist",12562166
"Movie","Inception",1573437
我想将文件拆分为单独的文件: Books.csv、Series.csv、Movie.csv
实际上会有数百个类别,它们不会被排序。在这种情况下,它们位于第一列,但将来可能不会。
我在网上找到了一些解决方案,但在 Python 中没有。有一个非常简单的 AWK 命令可以在一行中执行此操作,但我无法在工作中访问 AWK。
我编写了以下有效的代码,但我认为它可能非常低效。有人可以建议如何加快速度吗?
import csv
#Creates empty set - this will be used to store the values that have already been used
filelist = set()
#Opens the large csv file in "read" mode
with open('//directory/largefile', 'r') as csvfile:
#Read the first row of the large file and store the whole row as a string (headerstring)
read_rows = csv.reader(csvfile)
headerrow = next(read_rows)
headerstring=','.join(headerrow)
for row in read_rows:
#Store the whole row as a string (rowstring)
rowstring=','.join(row)
#Defines filename as the first entry in the row - This could be made dynamic so that the user inputs a column name to use
filename = (row[0])
#This basically makes sure it is not looking at the header row.
if filename != "Category":
#If the filename is not in the filelist set, add it to the list and create new csv file with header row.
if filename not in filelist:
filelist.add(filename)
with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
f.write(headerstring)
f.write("\n")
f.close()
#If the filename is in the filelist set, append the current row to the existing csv file.
else:
with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
f.write(rowstring)
f.write("\n")
f.close()
谢谢!
【问题讨论】:
-
为什么不使用
pandas?