在 pandas 中有效地增长数据框答案

【问题标题】：efficiently growing a dataframe in pandas在 pandas 中有效地增长数据框
【发布时间】：2018-10-25 15:46:53
【问题描述】：

在迭代的基础上，我正在生成一个如下所示的 DataFrame：

              RIC RICRoot ISIN ExpirationDate                      Exchange           ...            OpenInterest  BlockVolume  TotalVolume2  SecurityDescription  SecurityLongDescription
closingDate                                                                           ...                                                                                                 
2018-03-15   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-16   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-19   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-20   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-21   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None

我把它变成了一个多索引的 DF：

tmp.columns = pd.MultiIndex.from_arrays( [ [contract]*len(tmp.columns), tmp.columns.tolist() ] )

其中contract 只是该数据的引用名称，您可以在下面的输出中看到SPH0：

    SPH0                                                                     ...                                                                                            
              RIC RICRoot ISIN ExpirationDate                      Exchange           ...           OpenInterest BlockVolume TotalVolume2 SecurityDescription SecurityLongDescription
closingDate                                                                           ...                                                                                            
2018-03-15   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-16   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-19   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-20   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-21   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None

我目前有一种非常低效的方式来合并这些 DataFrame：

if df is None:
            df = tmp;
        else:
            df = df.merge( tmp, how='outer', left_index=True, right_index=True)

这非常慢。我想将所有这些 tempdf 与它们各自的合同名称一起存储在关联的映射样式中，并且能够以矢量化的方式轻松引用它们的数据。最佳解决方案是什么？水平/垂直增长重要吗？

【问题讨论】：

你为什么不直接使用set_index()？
何时合并？我不确定如何使用 set_index() 将 DataFrame 对象彼此附加。
请使用minimal reproducible example 发布完整的代码块，我们可以在空的 Python 环境中运行。最后一段是否在 for 循环内运行？
您能否在帖子中也包含您想要的输出？

标签： python pandas numpy dataframe

【解决方案1】：

IIUC，您可以使用 pd.concat() 并传递您的数据框列表和生成的 MultiIndex 数据框的键。采取以下数据框样本：

import pandas as pd

df1 = pd.DataFrame([                                                                                            
['2018-03-11',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-12',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-15',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-23',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-24',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])

df2 = pd.DataFrame([                                                                                            
['2018-03-15',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-16',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-22',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-24',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-20',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])

df3 = pd.DataFrame([                                                                                            
['2018-03-15',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-16',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-18',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-20',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-21',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])

现在拨打pd.concat():

pd.concat([df1, df2, df3], keys=['SPH0','HAB3','UHA6'])

产量：

       closingDate              ...                                   Exchange
SPH0 0  2018-03-11              ...               CME:Index and Options Market
     1  2018-03-12              ...               CME:Index and Options Market
     2  2018-03-15              ...               CME:Index and Options Market
     3  2018-03-23              ...               CME:Index and Options Market
     4  2018-03-24              ...               CME:Index and Options Market
HAB3 0  2018-03-15              ...               CME:Index and Options Market
     1  2018-03-16              ...               CME:Index and Options Market
     2  2018-03-22              ...               CME:Index and Options Market
     3  2018-03-24              ...               CME:Index and Options Market
     4  2018-03-20              ...               CME:Index and Options Market
UHA6 0  2018-03-15              ...               CME:Index and Options Market
     1  2018-03-16              ...               CME:Index and Options Market
     2  2018-03-18              ...               CME:Index and Options Market
     3  2018-03-20              ...               CME:Index and Options Market
     4  2018-03-21              ...               CME:Index and Options Market

您还可以使用列表推导来创建要传递给pd.concat() 的数据框列表，例如：

my_keys = ['SPH0','HAB3','UHA6']
dfs = [create_df(key) for key in my_keys]
pd.concat(dfs, keys=my_keys)

函数create_df()返回一个数据帧。

【讨论】：