【发布时间】:2017-03-08 01:12:53
【问题描述】:
我对 Python 的功能有疑问。 我有一个非常大的数据集(200 GB),我将使用 python 遍历行,将数据存储在字典中,然后执行一些计算。最后,我将计算数据写入 CSV 文件。 我担心的是我的电脑的容量。我担心(或者很确定)我的 RAM 无法存储这么大的数据集。有没有更好的办法? 这是输入数据的结构:
#RIC Date[L] Time[L] Type ALP-L1-BidPrice ALP-L1-BidSize ALP-L1-AskPrice ALP-L1-AskSize ALP-L2-BidPrice ALP-L2-BidSize ALP-L2-AskPrice ALP-L2-AskSize ALP-L3-BidPrice ALP-L3-BidSize ALP-L3-AskPrice ALP-L3-AskSize ALP-L4-BidPrice ALP-L4-BidSize ALP-L4-AskPrice ALP-L4-AskSize ALP-L5-BidPrice ALP-L5-BidSize ALP-L5-AskPrice ALP-L5-AskSize TOR-L1-BidPrice TOR-L1-BidSize TOR-L1-AskPrice TOR-L1-AskSize TOR-L2-BidPrice TOR-L2-BidSize TOR-L2-AskPrice TOR-L2-AskSize TOR-L3-BidPrice TOR-L3-BidSize TOR-L3-AskPrice TOR-L3-AskSize TOR-L4-BidPrice TOR-L4-BidSize TOR-L4-AskPrice TOR-L4-AskSize TOR-L5-BidPrice TOR-L5-BidSize TOR-L5-AskPrice TOR-L5-AskSize
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 16000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 46000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 22000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 36000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 32000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 40000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 44000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 36000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:12.1 Market Depth 5.29 50000 5.3 32000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 46000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:12.1 Market Depth 5.29 50000 5.3 38000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 36000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
这是我尝试做的: 1. 读入ta数据并将它们存储到带有键[symbol][time][bid]和[ask]等的字典中 2. 在任何时间点,找到最佳买价和最佳卖价(这需要水平排序/在我不知道如何的键中的值中)因为买价和卖价来自不同的交易所,我们需要找到最优惠的价格,并将它们与该特定价格的数量一起从最好到最差进行排名。 3. 导出为 csv 文件。
这是我对代码的尝试。请帮我写得更有效率:
# this file calculate the depth up to $50,000
import csv
from math import ceil
from collections import defaultdict
# open csv file
csv_file = open('2016_01_04-data_3_stocks.csv', 'rU')
reader = csv.DictReader(csv_file)
# Set variables:
date = None
exchange_depth = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
effective_spread = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
time_bucket = [i * 100000.0 for i in range(0, 57600000000 / 100000)]
# Set functions
def time_to_milli(times):
hours = float(times.split(':')[0]) * 60 * 60 * 1000000
minutes = float(times.split(':')[1]) * 60 * 1000000
seconds = float(times.split(':')[2]) * 1000000
milliseconds = float(times.split('.')[1])
timestamp = hours + minutes + seconds + milliseconds
return timestamp
# Extract data
for i in reader:
if not bool(date):
date = i['Date[L]'][0:4] + "-" + i['Date[L]'][4:6] + "-" + i['Date[L]'][6:8]
security = i['#RIC'].split('.')[0]
exchange = i['#RIC'].split('.')[1]
timestamp = float(time_to_milli(i['Time[L]']))
bucket = ceil(float(time_to_milli(i['Time[L]'])) / 100000.0) * 100000.0
# input bid price and bid size
exchange_depth[security][bucket][Bid][i['ALP-L1-BidPrice']] += i['ALP-L1-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L2-BidPrice']] += i['ALP-L2-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L3-BidPrice']] += i['ALP-L3-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L4-BidPrice']] += i['ALP-L4-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L5-BidPrice']] += i['ALP-L5-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L1-BidPrice']] += i['TOR-L1-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L2-BidPrice']] += i['TOR-L2-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L3-BidPrice']] += i['TOR-L3-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L4-BidPrice']] += i['TOR-L4-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L5-BidPrice']] += i['TOR-L5-BidSize']
# input ask price and ask size
exchange_depth[security][bucket][Ask][i['ALP-L1-AskPrice']] += i['ALP-L1-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L2-AskPrice']] += i['ALP-L2-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L3-AskPrice']] += i['ALP-L3-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L4-AskPrice']] += i['ALP-L4-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L5-AskPrice']] += i['ALP-L5-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L1-AskPrice']] += i['TOR-L1-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L2-AskPrice']] += i['TOR-L2-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L3-AskPrice']] += i['TOR-L3-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L4-AskPrice']] += i['TOR-L4-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L5-AskPrice']] += i['TOR-L5-AskSize']
# Now rank bid price and ask price among exchange_depth[security][bucket][Bid] and exchange_depth[security][bucket][Ask] keys
#I don't know how to do this
【问题讨论】:
-
如果你逐行处理,并且你用来处理数据的字典没有超出你的RAM,你应该没有任何问题。
-
正如@FranciscoCouzo 所说,如果您迭代 行(没有将所有内容加载到内存中)并且字典相当小,那么您应该没问题。但是,如果您提供了一些示例数据(数据集的几行)和您尝试执行的计算类型,我们可能会给您一个更好的答案。
-
这似乎太宽泛或缺少minimal reproducible example
-
欢迎来到 StackOverflow。请阅读并遵循帮助文档中的发布指南。 Minimal, complete, verifiable example 适用于此。在您发布代码并准确描述问题之前,我们无法有效地帮助您。
-
谢谢大家,我会尽快上传代码。通常,我将逐行读取数据并将它们存储到字典中,我将在其中进行计算。据我了解,字典将存储在 RAM 中,对吗?
标签: python python-2.7 dictionary large-files large-data