【发布时间】:2018-05-24 16:25:01
【问题描述】:
我有一个包含 4000 万条记录的大型数据集,总大小约为 21.0G,存储在 MongoDB 中。我花了几个小时将它加载到 pandas 数据框中。但总内存大小增加到约 28.7G(加载前约为 600Mb)。
cursor = mongocollection.find()
data = pd.DataFrame()
count = 0
dataset = []
for i in cursor:
dataset.append(i)
del i
count += 1
if count % 100000 == 0:
print(count)
temp = pd.DataFrame(dataset, columns=dataset[0].keys())
dataset = []
data = data.append(temp)
temp = pd.DataFrame(dataset, columns=dataset[0].keys())
dataset = []
data = data.append(temp)
担心加载此类数据集的时间成本,使用 pd.to_csv('localdisk.csv') 将其保存到本地磁盘。 csv 文件大小为 7.1Gb。
所以问题是为什么 csv 文件这么小,而使用的数据帧(或其他数据?)的内存大小大约是 4 倍,并且有没有更好的解决方案来减少数据帧的内存使用量。我有另一个包含超过 1 亿个相同项目的数据集。想知道我是否能够使用这样的解决方案加载到内存中。
PS:我认为将数据加载到内存中花费这么多时间的原因是这三个命令:
temp = pd.DataFrame(dataset, columns=dataset[0].keys())
dataset = []
data = data.append(temp)
dataset 中有 60,000 个项目,加载到 data(pandas DataFrame)需要大约 5-10 分钟
> data.memory_usage(index=True).sum()
6451973127 bytes # About 6G, close to the size of csv file.
更新:
生成指标的代码
这个SO answer 说concat 比append 有效。我还没有测试。
last_time = time.time()
for i in cursor:
dataset.append(i)
del i
count += 1
if count % 100000 == 0:
temp = pd.DataFrame(dataset, columns=dataset[0].keys())
dataset = []
data = pd.concat([data,temp])
current_time = time.time()
cost_time = current_time - last_time
last_time = current_time
memory_usage = psutil.virtual_memory().used / (1024**3)
print("count is {}, cost time is {}, memory usage is {}".format(count, cost_time, memory_usage))
将数据加载到数据框中的指标
count is 100000, cost time is 12.346338033676147, memory usage is 0.7630538940429688
count is 200000, cost time is 8.272525310516357, memory usage is 0.806121826171875
count is 300000, cost time is 10.19885516166687, memory usage is 0.9408340454101562
count is 400000, cost time is 6.370742082595825, memory usage is 0.9675140380859375
count is 500000, cost time is 7.93895959854126, memory usage is 0.9923629760742188
count is 600000, cost time is 12.54422402381897, memory usage is 1.1193618774414062
count is 700000, cost time is 9.631025552749634, memory usage is 1.1592445373535156
count is 800000, cost time is 7.459330081939697, memory usage is 1.1680374145507812
count is 900000, cost time is 9.528786659240723, memory usage is 1.2807159423828125
count is 1000000, cost time is 7.681959867477417, memory usage is 1.2977333068847656
count is 1100000, cost time is 7.3086090087890625, memory usage is 1.3396949768066406
count is 1200000, cost time is 11.282068252563477, memory usage is 1.4544296264648438
count is 1300000, cost time is 9.21155858039856, memory usage is 1.4788284301757812
count is 1400000, cost time is 10.056787014007568, memory usage is 1.5263175964355469
count is 1500000, cost time is 12.212023973464966, memory usage is 1.6380157470703125
count is 1600000, cost time is 14.238991260528564, memory usage is 1.69512939453125
count is 1700000, cost time is 8.800130128860474, memory usage is 1.7134437561035156
count is 1800000, cost time is 11.374922275543213, memory usage is 1.8270645141601562
count is 1900000, cost time is 8.9767906665802, memory usage is 1.8472061157226562
count is 2000000, cost time is 8.989881992340088, memory usage is 1.8804588317871094
count is 2100000, cost time is 11.93136477470398, memory usage is 2.000476837158203
count is 2200000, cost time is 11.224282264709473, memory usage is 2.016876220703125
count is 2300000, cost time is 13.535449266433716, memory usage is 2.0394668579101562
count is 2400000, cost time is 12.848443269729614, memory usage is 2.1280059814453125
count is 2500000, cost time is 12.208937883377075, memory usage is 2.138408660888672
count is 2600000, cost time is 16.975553512573242, memory usage is 2.2880821228027344
count is 2700000, cost time is 19.275086879730225, memory usage is 2.287738800048828
count is 2800000, cost time is 11.386988639831543, memory usage is 2.3098106384277344
count is 2900000, cost time is 13.70014500617981, memory usage is 2.3990440368652344
count is 3000000, cost time is 10.45867395401001, memory usage is 2.420604705810547
count is 3100000, cost time is 10.75408387184143, memory usage is 2.4437637329101562
count is 3200000, cost time is 15.346243619918823, memory usage is 2.5608978271484375
count is 3300000, cost time is 12.275937795639038, memory usage is 2.5855789184570312
count is 3400000, cost time is 11.398426532745361, memory usage is 2.6102142333984375
count is 3500000, cost time is 17.990268230438232, memory usage is 2.7031402587890625
count is 3600000, cost time is 11.90847396850586, memory usage is 2.724163055419922
count is 3700000, cost time is 14.961709260940552, memory usage is 2.8711891174316406
count is 3800000, cost time is 13.13991904258728, memory usage is 2.8688430786132812
count is 3900000, cost time is 12.900552749633789, memory usage is 2.8935928344726562
count is 4000000, cost time is 15.278205633163452, memory usage is 3.01715087890625
count is 4100000, cost time is 12.421746492385864, memory usage is 3.044261932373047
count is 4200000, cost time is 12.715410232543945, memory usage is 3.1170883178710938
count is 4300000, cost time is 15.297654867172241, memory usage is 3.195178985595703
count is 4400000, cost time is 11.920997858047485, memory usage is 3.2213592529296875
count is 4500000, cost time is 12.397282123565674, memory usage is 3.2494659423828125
count is 4600000, cost time is 13.162795305252075, memory usage is 3.3564605712890625
count is 4700000, cost time is 14.042455434799194, memory usage is 3.413494110107422
count is 4800000, cost time is 10.402931451797485, memory usage is 3.3945388793945312
count is 4900000, cost time is 13.326395034790039, memory usage is 3.4888954162597656
count is 5000000, cost time is 11.762998580932617, memory usage is 3.5169677734375
count is 5100000, cost time is 13.566682577133179, memory usage is 3.610504150390625
count is 5200000, cost time is 11.697095155715942, memory usage is 3.637969970703125
count is 5300000, cost time is 11.785945415496826, memory usage is 3.702167510986328
count is 5400000, cost time is 20.747815132141113, memory usage is 3.7620506286621094
count is 5500000, cost time is 12.001267910003662, memory usage is 3.788776397705078
count is 5600000, cost time is 12.201840877532959, memory usage is 3.8513031005859375
count is 5700000, cost time is 16.82955837249756, memory usage is 3.9653396606445312
count is 5800000, cost time is 12.35794973373413, memory usage is 3.9715538024902344
count is 5900000, cost time is 12.41870403289795, memory usage is 3.999217987060547
count is 6000000, cost time is 14.590713024139404, memory usage is 4.0941619873046875
count is 6100000, cost time is 13.40040898323059, memory usage is 4.119499206542969
count is 6200000, cost time is 15.54603385925293, memory usage is 4.2159881591796875
count is 6300000, cost time is 12.232314348220825, memory usage is 4.2417449951171875
count is 6400000, cost time is 12.939337491989136, memory usage is 4.268760681152344
count is 6500000, cost time is 15.472190856933594, memory usage is 4.371849060058594
count is 6600000, cost time is 13.525130987167358, memory usage is 4.392463684082031
count is 6700000, cost time is 13.798184633255005, memory usage is 4.467185974121094
count is 6800000, cost time is 16.133020877838135, memory usage is 4.513973236083984
count is 6900000, cost time is 20.654539108276367, memory usage is 4.537406921386719
count is 7000000, cost time is 15.181331872940063, memory usage is 4.617683410644531
count is 7100000, cost time is 16.90074348449707, memory usage is 4.6607208251953125
count is 7200000, cost time is 15.26277780532837, memory usage is 4.6886749267578125
count is 7300000, cost time is 13.590909719467163, memory usage is 4.7701873779296875
count is 7400000, cost time is 17.623094081878662, memory usage is 4.812957763671875
count is 7500000, cost time is 14.904731035232544, memory usage is 4.8453521728515625
count is 7600000, cost time is 16.52383327484131, memory usage is 4.992897033691406
count is 7700000, cost time is 14.730050325393677, memory usage is 4.961498260498047
count is 7800000, cost time is 14.83224892616272, memory usage is 4.986965179443359
count is 7900000, cost time is 16.819100856781006, memory usage is 5.141094207763672
count is 8000000, cost time is 16.299737691879272, memory usage is 5.108722686767578
count is 8100000, cost time is 15.587513208389282, memory usage is 5.14031982421875
count is 8200000, cost time is 19.151288747787476, memory usage is 5.296863555908203
count is 8300000, cost time is 15.674288511276245, memory usage is 5.3394622802734375
count is 8400000, cost time is 16.563526153564453, memory usage is 5.292533874511719
count is 8500000, cost time is 20.42433261871338, memory usage is 5.447917938232422
count is 8600000, cost time is 15.694331884384155, memory usage is 5.412452697753906
count is 8700000, cost time is 20.2867329120636, memory usage is 5.571533203125
count is 8800000, cost time is 18.203043222427368, memory usage is 5.532035827636719
count is 8900000, cost time is 16.625596523284912, memory usage is 5.628833770751953
count is 9000000, cost time is 23.0804705619812, memory usage is 5.652252197265625
count is 9100000, cost time is 17.696472883224487, memory usage is 5.745880126953125
count is 9200000, cost time is 15.72276496887207, memory usage is 5.705802917480469
UPDATE2
数据规范化代码(小整数和分类)
last_time = time.time()
dtypes = {"somecount":'int32',"somecount":"int32","somecate":"category","somecount":"int32","somecate":"category","somecount":"int32","somecount":"int32","somecate":"category"}
for i in cursor:
del i['something']
del i['sometime']
del i['something']
del i['something']
del i['someint']
dataset.append(i)
del i
count += 1
if count % 100000 == 0:
temp = pd.DataFrame(dataset,columns=dataset[0].keys())
temp.fillna(0,inplace=True)
temp = temp.astype(dtypes, errors="ignore")
dataset = []
data = pd.concat([data,temp])
优化指标:
总内存使用量几乎减少了上述 1 的一半。但是concating/appending的时间变化不大。
data length is 37800000,count is 37800000, cost time is 132.23220038414001, memory usage is 11.789329528808594
data length is 37900000,count is 37900000, cost time is 65.34806060791016, memory usage is 11.7882080078125
data length is 38000000,count is 38000000, cost time is 122.15527963638306, memory usage is 11.804153442382812
data length is 38100000,count is 38100000, cost time is 47.79928374290466, memory usage is 11.828723907470703
data length is 38200000,count is 38200000, cost time is 49.70282459259033, memory usage is 11.837543487548828
data length is 38300000,count is 38300000, cost time is 155.42868423461914, memory usage is 11.895767211914062
data length is 38400000,count is 38400000, cost time is 105.94551157951355, memory usage is 11.947330474853516
data length is 38500000,count is 38500000, cost time is 136.1993544101715, memory usage is 12.013351440429688
data length is 38600000,count is 38600000, cost time is 114.5268976688385, memory usage is 12.013912200927734
data length is 38700000,count is 38700000, cost time is 53.31018781661987, memory usage is 12.017452239990234
data length is 38800000,count is 38800000, cost time is 65.94741868972778, memory usage is 12.058589935302734
data length is 38900000,count is 38900000, cost time is 42.62899565696716, memory usage is 12.067787170410156
data length is 39000000,count is 39000000, cost time is 57.95372486114502, memory usage is 11.979434967041016
data length is 39100000,count is 39100000, cost time is 62.12286162376404, memory usage is 12.026973724365234
data length is 39200000,count is 39200000, cost time is 80.76535606384277, memory usage is 12.111717224121094
【问题讨论】:
-
这种方法被称为使用 chunks。最好将
dataset重命名为chunk,因为它就是这样。无论如何,众所周知,迭代地appending 到一个列表非常慢且占用大量内存(它涉及复制)。
标签: python mongodb pandas memory