【问题标题】:efficiently growing a dataframe in pandas在 pandas 中有效地增长数据框
【发布时间】:2018-10-25 15:46:53
【问题描述】:

在迭代的基础上,我正在生成一个如下所示的 DataFrame:

              RIC RICRoot ISIN ExpirationDate                      Exchange           ...            OpenInterest  BlockVolume  TotalVolume2  SecurityDescription  SecurityLongDescription
closingDate                                                                           ...                                                                                                 
2018-03-15   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-16   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-19   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-20   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-21   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None

我把它变成了一个多索引的 DF:

tmp.columns = pd.MultiIndex.from_arrays( [ [contract]*len(tmp.columns), tmp.columns.tolist() ] )

其中contract 只是该数据的引用名称,您可以在下面的输出中看到SPH0

    SPH0                                                                     ...                                                                                            
              RIC RICRoot ISIN ExpirationDate                      Exchange           ...           OpenInterest BlockVolume TotalVolume2 SecurityDescription SecurityLongDescription
closingDate                                                                           ...                                                                                            
2018-03-15   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-16   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-19   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-20   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-21   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None

我目前有一种非常低效的方式来合并这些 DataFrame:

if df is None:
            df = tmp;
        else:
            df = df.merge( tmp, how='outer', left_index=True, right_index=True)

这非常慢。我想将所有这些 tempdf 与它们各自的合同名称一起存储在关联的映射样式中,并且能够以矢量化的方式轻松引用它们的数据。最佳解决方案是什么?水平/垂直增长重要吗?

【问题讨论】:

  • 你为什么不直接使用set_index()
  • 何时合并?我不确定如何使用 set_index() 将 DataFrame 对象彼此附加。
  • 请使用minimal reproducible example 发布完整的代码块,我们可以在空的 Python 环境中运行。最后一段是否在 for 循环内运行?
  • 您能否在帖子中也包含您想要的输出?

标签: python pandas numpy dataframe


【解决方案1】:

IIUC,您可以使用 pd.concat() 并传递您的数据框列表和生成的 MultiIndex 数据框的键。采取以下数据框样本:

import pandas as pd

df1 = pd.DataFrame([                                                                                            
['2018-03-11',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-12',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-15',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-23',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-24',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])

df2 = pd.DataFrame([                                                                                            
['2018-03-15',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-16',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-22',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-24',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-20',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])

df3 = pd.DataFrame([                                                                                            
['2018-03-15',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-16',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-18',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-20',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-21',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])

现在拨打pd.concat():

pd.concat([df1, df2, df3], keys=['SPH0','HAB3','UHA6'])

产量:

       closingDate              ...                                   Exchange
SPH0 0  2018-03-11              ...               CME:Index and Options Market
     1  2018-03-12              ...               CME:Index and Options Market
     2  2018-03-15              ...               CME:Index and Options Market
     3  2018-03-23              ...               CME:Index and Options Market
     4  2018-03-24              ...               CME:Index and Options Market
HAB3 0  2018-03-15              ...               CME:Index and Options Market
     1  2018-03-16              ...               CME:Index and Options Market
     2  2018-03-22              ...               CME:Index and Options Market
     3  2018-03-24              ...               CME:Index and Options Market
     4  2018-03-20              ...               CME:Index and Options Market
UHA6 0  2018-03-15              ...               CME:Index and Options Market
     1  2018-03-16              ...               CME:Index and Options Market
     2  2018-03-18              ...               CME:Index and Options Market
     3  2018-03-20              ...               CME:Index and Options Market
     4  2018-03-21              ...               CME:Index and Options Market

您还可以使用列表推导来创建要传递给pd.concat() 的数据框列表,例如:

my_keys = ['SPH0','HAB3','UHA6']
dfs = [create_df(key) for key in my_keys]
pd.concat(dfs, keys=my_keys)

函数create_df()返回一个数据帧。

【讨论】:

    猜你喜欢
    • 2019-04-21
    • 2021-06-18
    • 1970-01-01
    • 2021-03-05
    • 2017-08-31
    • 2020-09-27
    • 1970-01-01
    • 2021-05-14
    • 2013-10-13
    相关资源
    最近更新 更多