【发布时间】:2016-04-19 10:00:59
【问题描述】:
我正在为一个项目处理 2 个大型数据集文件。我逐行管理清理文件。但是,在尝试应用相同的逻辑来合并基于公共列的 2 个文件时,它会失败。问题是第二个循环完全运行,然后顶部循环运行(不知道为什么会这样)。我尝试使用 numpy
buys = np.genfromtxt('buys_dtsep.dat',delimiter=",",dtype='str')
clicks = np.genfromtxt('clicks_dtsep.dat',delimiter=",",dtype='str')
f = open('combined.dat', 'w')
for s in clicks:
for s2 in buys:
#process data
但是由于内存限制以及将数据加载到数组然后处理它所花费的时间,将具有 3300 万个条目的文件加载到数组中是不可行的。我正在尝试逐行处理文件以避免内存不足。
buys = open('buys_dtsep.dat')
clicks = open('clicks_dtsep.dat')
f = open('combined.dat', 'w')
csv_buys = csv.reader(buys)
csv_clicks = csv.reader(clicks)
for s in csv_clicks:
print 'file 1 row x'#to check when it loops
for s2 in csv_buys:
print s2[0] #check looped data
#do merge op
打印的输出应该是
file 1 row 0
file 2 row 0
...
file 2 row x
file 1 row 1
and so on
我得到的输出是
file 2 row 0
file 2 row 1
...
file 2 row x
file 1 row 0
...
file 1 row z
如果上述循环问题可以解决,将无法逐行合并文件。
更新:示例数据
购买文件样本
420374,2014-04-06,18:44:58.314,214537888,12462,1
420374,2014-04-06,18:44:58.325,214537850,10471,1
281626,2014-04-06,09:40:13.032,214535653,1883,1
420368,2014-04-04,06:13:28.848,214530572,6073,1
420368,2014-04-04,06:13:28.858,214835025,2617,1
140806,2014-04-07,09:22:28.132,214668193,523,1
140806,2014-04-07,09:22:28.176,214587399,1046,1
点击文件示例
420374,2014-04-06,18:44:58,214537888,0
420374,2014-04-06,18:41:50,214537888,0
420374,2014-04-06,18:42:33,214537850,0
420374,2014-04-06,18:42:38,214537850,0
420374,2014-04-06,18:43:02,214537888,0
420374,2014-04-06,18:43:10,214537888,0
420369,2014-04-07,19:39:43,214839373,0
420369,2014-04-07,19:39:56,214684513,0
【问题讨论】:
-
你可以使用
pandas吗?如果是,您可以考虑read_csv和chunks参数,例如 that 示例 -
您能从两个 dat 文件中添加一些示例行吗?
标签: python csv numpy memory-management