python中的大字典超过了RAM容量答案

【问题标题】：large dictionary in python exceed RAM capacitypython中的大字典超过了RAM容量
【发布时间】：2017-03-08 01:12:53
【问题描述】：

我对 Python 的功能有疑问。我有一个非常大的数据集（200 GB），我将使用 python 遍历行，将数据存储在字典中，然后执行一些计算。最后，我将计算数据写入 CSV 文件。我担心的是我的电脑的容量。我担心（或者很确定）我的 RAM 无法存储这么大的数据集。有没有更好的办法？这是输入数据的结构：

#RIC    Date[L] Time[L] Type    ALP-L1-BidPrice ALP-L1-BidSize  ALP-L1-AskPrice ALP-L1-AskSize  ALP-L2-BidPrice ALP-L2-BidSize  ALP-L2-AskPrice ALP-L2-AskSize  ALP-L3-BidPrice ALP-L3-BidSize  ALP-L3-AskPrice ALP-L3-AskSize  ALP-L4-BidPrice ALP-L4-BidSize  ALP-L4-AskPrice ALP-L4-AskSize  ALP-L5-BidPrice ALP-L5-BidSize  ALP-L5-AskPrice ALP-L5-AskSize  TOR-L1-BidPrice TOR-L1-BidSize  TOR-L1-AskPrice TOR-L1-AskSize  TOR-L2-BidPrice TOR-L2-BidSize  TOR-L2-AskPrice TOR-L2-AskSize  TOR-L3-BidPrice TOR-L3-BidSize  TOR-L3-AskPrice TOR-L3-AskSize  TOR-L4-BidPrice TOR-L4-BidSize  TOR-L4-AskPrice TOR-L4-AskSize  TOR-L5-BidPrice TOR-L5-BidSize  TOR-L5-AskPrice TOR-L5-AskSize
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 16000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 46000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 22000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 36000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 32000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 40000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 44000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 36000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:12.1 Market Depth    5.29    50000   5.3 32000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 46000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:12.1 Market Depth    5.29    50000   5.3 38000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 36000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000

这是我尝试做的： 1. 读入ta数据并将它们存储到带有键[symbol][time][bid]和[ask]等的字典中 2. 在任何时间点，找到最佳买价和最佳卖价（这需要水平排序/在我不知道如何的键中的值中）因为买价和卖价来自不同的交易所，我们需要找到最优惠的价格，并将它们与该特定价格的数量一起从最好到最差进行排名。 3. 导出为 csv 文件。

这是我对代码的尝试。请帮我写得更有效率：

# this file calculate the depth up to $50,000

import csv
from math import ceil
from collections import defaultdict

# open csv file
csv_file = open('2016_01_04-data_3_stocks.csv', 'rU')
reader = csv.DictReader(csv_file)

# Set variables:
date = None
exchange_depth = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
effective_spread = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
time_bucket = [i * 100000.0 for i in range(0, 57600000000 / 100000)]

# Set functions
def time_to_milli(times):
    hours = float(times.split(':')[0]) * 60 * 60 * 1000000
    minutes = float(times.split(':')[1]) * 60 * 1000000
    seconds = float(times.split(':')[2]) * 1000000
    milliseconds = float(times.split('.')[1])
    timestamp = hours + minutes + seconds + milliseconds
    return timestamp


# Extract data
for i in reader:
    if not bool(date):
        date = i['Date[L]'][0:4] + "-" + i['Date[L]'][4:6] + "-" + i['Date[L]'][6:8]
    security = i['#RIC'].split('.')[0]
    exchange = i['#RIC'].split('.')[1]
    timestamp = float(time_to_milli(i['Time[L]']))
    bucket = ceil(float(time_to_milli(i['Time[L]'])) / 100000.0) * 100000.0
    # input bid price and bid size
    exchange_depth[security][bucket][Bid][i['ALP-L1-BidPrice']] += i['ALP-L1-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L2-BidPrice']] += i['ALP-L2-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L3-BidPrice']] += i['ALP-L3-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L4-BidPrice']] += i['ALP-L4-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L5-BidPrice']] += i['ALP-L5-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L1-BidPrice']] += i['TOR-L1-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L2-BidPrice']] += i['TOR-L2-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L3-BidPrice']] += i['TOR-L3-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L4-BidPrice']] += i['TOR-L4-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L5-BidPrice']] += i['TOR-L5-BidSize']
    # input ask price and ask size
    exchange_depth[security][bucket][Ask][i['ALP-L1-AskPrice']] += i['ALP-L1-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L2-AskPrice']] += i['ALP-L2-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L3-AskPrice']] += i['ALP-L3-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L4-AskPrice']] += i['ALP-L4-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L5-AskPrice']] += i['ALP-L5-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L1-AskPrice']] += i['TOR-L1-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L2-AskPrice']] += i['TOR-L2-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L3-AskPrice']] += i['TOR-L3-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L4-AskPrice']] += i['TOR-L4-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L5-AskPrice']] += i['TOR-L5-AskSize']
# Now rank bid price and ask price among exchange_depth[security][bucket][Bid] and exchange_depth[security][bucket][Ask] keys
    #I don't know how to do this

【问题讨论】：

如果你逐行处理，并且你用来处理数据的字典没有超出你的RAM，你应该没有任何问题。
正如@FranciscoCouzo 所说，如果您迭代行（没有将所有内容加载到内存中）并且字典相当小，那么您应该没问题。但是，如果您提供了一些示例数据（数据集的几行）和您尝试执行的计算类型，我们可能会给您一个更好的答案。
这似乎太宽泛或缺少minimal reproducible example
欢迎来到 StackOverflow。请阅读并遵循帮助文档中的发布指南。 Minimal, complete, verifiable example 适用于此。在您发布代码并准确描述问题之前，我们无法有效地帮助您。
谢谢大家，我会尽快上传代码。通常，我将逐行读取数据并将它们存储到字典中，我将在其中进行计算。据我了解，字典将存储在 RAM 中，对吗？

标签： python python-2.7 dictionary large-files large-data

【解决方案1】：

根据您告诉我们的内容，您可以执行以下操作：

import csv
with open("path/to/my_dataset", 'r') as input_f, open("output.csv", 'a') as output_f:
    # Keep reading lines from input data until you run out
    for line in f:
        # do processing and add to processed
        processed = []

        # write processed data to output file
        csv.writer(output_f).writerow(processed)

【讨论】：

有点困惑，为什么你不只是为for line_num, line in enumerate(f, 1): 而不是两个readline 在while 循环内部和外部调用，并手动计算行号。另外，为什么您不只打开一次输出文件，将csv.writer 换行一次，并在完成时编写每一行，从而节省不断打开和关闭文件的非实质性成本；批处理它们是没有意义的，无论如何文件对象都会为你缓冲。
@ShadowRanger 你是对的，我刚刚快速完成了这个。这些都是很好的简化。
@ShadowRanger 如果您每 x 个输入行仅写入一次输出文件，则批处理写入可能很有用。
这就是我提到缓冲的原因。即使你“写”每一行，简单的文件对象缓冲区；您可能会调用write 100 次，但它实际上只在每次缓冲文件时对底层文件执行真正的写入。如果平均写入为 100 字节，缓冲区为 8KB（Python 默认缓冲区大小），那么对 write 的这 100 次调用实际上只执行了两次系统调用以写入支持文件。如果太频繁，请增加缓冲区大小（这是一个可选参数）；手动批处理几乎没有优势。
我已经编辑了帖子以包含输入格式和我的代码尝试