【发布时间】:2018-02-10 03:15:38
【问题描述】:
我正在使用一个当前使用大型 (>5GB) .csv 文件运行的系统。为了提高性能,我正在测试 (A) 从磁盘创建数据帧的不同方法(pandas VS dask)以及 (B) 将结果存储到磁盘的不同方法(.csv VS hdf5 文件)。
为了对性能进行基准测试,我做了以下操作:
def dask_read_from_hdf():
results_dd_hdf = dd.read_hdf('store.h5', key='period1', columns = ['Security'])
analyzed_stocks_dd_hdf = results_dd_hdf.Security.unique()
hdf.close()
def pandas_read_from_hdf():
results_pd_hdf = pd.read_hdf('store.h5', key='period1', columns = ['Security'])
analyzed_stocks_pd_hdf = results_pd_hdf.Security.unique()
hdf.close()
def dask_read_from_csv():
results_dd_csv = dd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
analyzed_stocks_dd_csv = results_dd_csv.Security.unique()
def pandas_read_from_csv():
results_pd_csv = pd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
analyzed_stocks_pd_csv = results_pd_csv.Security.unique()
print "dask hdf performance"
%timeit dask_read_from_hdf()
gc.collect()
print""
print "pandas hdf performance"
%timeit pandas_read_from_hdf()
gc.collect()
print""
print "dask csv performance"
%timeit dask_read_from_csv()
gc.collect()
print""
print "pandas csv performance"
%timeit pandas_read_from_csv()
gc.collect()
我的发现是:
dask hdf performance
10 loops, best of 3: 133 ms per loop
pandas hdf performance
1 loop, best of 3: 1.42 s per loop
dask csv performance
1 loop, best of 3: 7.88 ms per loop
pandas csv performance
1 loop, best of 3: 827 ms per loop
当 hdf5 存储的访问速度比 .csv 快,并且当 dask 创建数据帧的速度比 pandas 快时,为什么 hdf5 的 dask 比 csv 的 dask 慢?我做错了吗?
什么时候从 HDF5 存储对象创建 dask 数据帧对性能有意义?
【问题讨论】: