【问题标题】:How to access prior rows within a multiindex Panda dataframe如何访问多索引 Panda 数据框中的先前行
【发布时间】:2017-01-28 19:01:09
【问题描述】:

如何在 Datetime 索引的多级 Dataframe 中访问,例如:这是下载的 Fin 数据。 困难的部分是进入框架并访问特定内部级别的非相邻行,而无需明确指定外部级别日期,因为我有数千个这样的行..

                                       ABC        DEF        GHI  \  
Date                STATS                                            
2012-07-19 00:00:00                    NaN         NaN         NaN   
                    investment        4             9          13        
                    price             5             8          1  
                    quantity          12            9          8   

所以我正在搜索的 2 个公式可以总结为

X(today row) = quantity(prior row)*price(prior row) 
or                           
X(today row) = quantity(prior row)*price(today)

困难在于如何使用 numpy 或 panda 为多级索引制定对这些行的访问,并且这些行不相邻。

最后我会得到这个:

                                         ABC        DEF        GHI    XN
Date                STATS                                            
2012-07-19 00:00:00                    NaN         NaN         NaN   
                    investment          4            9          13    X1
                    price               5            8           1   
                    quantity            12           9           8    

2012-07-18 00:00:00                    NaN         NaN         NaN   
                    investment          1             2          3    X2
                    price               2             3          4   
                    quantity           18             6          7    

X1= (18*2)+(6*3)+(7*4) (quantity_day_2 *price_day_2 data) 
or for the other formula
X1= (18*5)+(6*8)+(7*1) (quantity_day_2 *price_day_1 data)

我可以使用 groupby 吗?

【问题讨论】:

  • 你能用小整数值修改数据框(以便于验证)并从数据样本中添加所需的输出吗?谢谢。
  • 哦,是的,可以的
  • 请告诉我什么时候问题会被评论改变。谢谢。
  • @jezrael 好的。谢谢
  • 一个问题 - 想要的输出是什么? 2 个新的DataFrame s?

标签: python pandas indexing dataframe multi-index


【解决方案1】:

如果需要将输出添加到原始DataFrame,那就更复杂了:

print (df)
                        ABC  DEF   GHI
Date       STATS                      
2012-07-19              NaN  NaN   NaN
           investment   4.0  9.0  13.0
           price        5.0  8.0   1.0
           quantity    12.0  9.0   8.0
2012-07-18              NaN  NaN   NaN
           investment   1.0  2.0   3.0
           price        2.0  3.0   4.0
           quantity    18.0  6.0   7.0
2012-07-17              NaN  NaN   NaN
           investment   1.0  2.0   3.0
           price        0.0  1.0   4.0
           quantity     5.0  1.0   0.0
df.sort_index(inplace=True)

#rename value in level to investment - align data in final concat
idx = pd.IndexSlice
p = df.loc[idx[:,'price'],:].rename(index={'price':'investment'})
q = df.loc[idx[:,'quantity'],:].rename(index={'quantity':'investment'})
print (p)
                       ABC  DEF  GHI
Date       STATS                    
2012-07-17 investment  0.0  1.0  4.0
2012-07-18 investment  2.0  3.0  4.0
2012-07-19 investment  5.0  8.0  1.0

print (q)
                        ABC  DEF  GHI
Date       STATS                     
2012-07-17 investment   5.0  1.0  0.0
2012-07-18 investment  18.0  6.0  7.0
2012-07-19 investment  12.0  9.0  8.0

#multiple and concat to original df
print (p * q)
                        ABC   DEF   GHI
Date       STATS                       
2012-07-17 investment   0.0   1.0   0.0
2012-07-18 investment  36.0  18.0  28.0
2012-07-19 investment  60.0  72.0   8.0
a = (p * q).sum(axis=1).rename('col1')
print (pd.concat([df, a], axis=1))
                        ABC  DEF   GHI   col1
Date       STATS                             
2012-07-17              NaN  NaN   NaN    NaN
           investment   1.0  2.0   3.0    1.0
           price        0.0  1.0   4.0    NaN
           quantity     5.0  1.0   0.0    NaN
2012-07-18              NaN  NaN   NaN    NaN
           investment   1.0  2.0   3.0   82.0
           price        2.0  3.0   4.0    NaN
           quantity    18.0  6.0   7.0    NaN
2012-07-19              NaN  NaN   NaN    NaN
           investment   4.0  9.0  13.0  140.0
           price        5.0  8.0   1.0    NaN
           quantity    12.0  9.0   8.0    NaN
#shift with Multiindex - not supported yet - first create Datatimeindex with unstack
#, then shift and last reshape to original by stack

#multiple and concat to original df
print (p.unstack().shift(-1, freq='D').stack() * q)
                        ABC   DEF  GHI
Date       STATS                      
2012-07-16 investment   NaN   NaN  NaN
2012-07-17 investment  10.0   3.0  0.0
2012-07-18 investment  90.0  48.0  7.0
2012-07-19 investment   NaN   NaN  NaN

b = (p.unstack().shift(-1, freq='D').stack() * q).sum(axis=1).rename('col2')
print (pd.concat([df, b], axis=1))
                        ABC  DEF   GHI   col2
Date       STATS                             
2012-07-16 investment   NaN  NaN   NaN    0.0
2012-07-17              NaN  NaN   NaN    NaN
           investment   1.0  2.0   3.0   13.0
           price        0.0  1.0   4.0    NaN
           quantity     5.0  1.0   0.0    NaN
2012-07-18              NaN  NaN   NaN    NaN
           investment   1.0  2.0   3.0  145.0
           price        2.0  3.0   4.0    NaN
           quantity    18.0  6.0   7.0    NaN
2012-07-19              NaN  NaN   NaN    NaN
           investment   4.0  9.0  13.0    0.0
           price        5.0  8.0   1.0    NaN
           quantity    12.0  9.0   8.0    NaN

【讨论】:

  • 请检查我的解决方案 - 输出是 2 个数据帧 - 1.st 和 2. 条件。
  • 你用(pd.concat([df, b], axis=1)).to_csv()吗?
【解决方案2】:

你可以使用:

#add new datetime with data for better testing
print (df)
                        ABC  DEF   GHI
Date       STATS                      
2012-07-19              NaN  NaN   NaN
           investment   4.0  9.0  13.0
           price        5.0  8.0   1.0
           quantity    12.0  9.0   8.0
2012-07-18              NaN  NaN   NaN
           investment   1.0  2.0   3.0
           price        2.0  3.0   4.0
           quantity    18.0  6.0   7.0
2012-07-17              NaN  NaN   NaN
           investment   1.0  2.0   3.0
           price        0.0  1.0   4.0
           quantity     5.0  1.0   0.0
#lexsorted Multiindex           
df.sort_index(inplace=True)

#select data and remove last level, because:
#1. need shift
#2. easier working
idx = pd.IndexSlice
p = df.loc[idx[:,'price'],:]
p.index = p.index.droplevel(-1)
q = df.loc[idx[:,'quantity'],:]
q.index = q.index.droplevel(-1)
print (p)
            ABC  DEF  GHI
Date                     
2012-07-17  0.0  1.0  4.0
2012-07-18  2.0  3.0  4.0
2012-07-19  5.0  8.0  1.0

print (q)
             ABC  DEF  GHI
Date                      
2012-07-17   5.0  1.0  0.0
2012-07-18  18.0  6.0  7.0
2012-07-19  12.0  9.0  8.0
print (p * q)
             ABC   DEF   GHI
Date                        
2012-07-17   0.0   1.0   0.0
2012-07-18  36.0  18.0  28.0
2012-07-19  60.0  72.0   8.0

print ((p * q).sum(axis=1).to_frame().rename(columns={0:'col1'}))
             col1
Date             
2012-07-17    1.0
2012-07-18   82.0
2012-07-19  140.0
#shift row with -1, because lexsorted df
print (p.shift(-1, freq='D') * q)
             ABC   DEF  GHI
Date                       
2012-07-16   NaN   NaN  NaN
2012-07-17  10.0   3.0  0.0
2012-07-18  90.0  48.0  7.0
2012-07-19   NaN   NaN  NaN

print ((p.shift(-1, freq='D') * q).sum(axis=1).to_frame().rename(columns={0:'col2'}))
             col2
Date             
2012-07-16    0.0
2012-07-17   13.0
2012-07-18  145.0
2012-07-19    0.0

【讨论】:

  • 感谢您一直以来的辛勤工作!我将通过它并在之后发表评论.. 我想我总是可以将 col2 复制到我原来的 pd.Series(XN) 之后,对吧?因为我实际上需要保持内在水平..
  • ;) 我询问所需的输出 - 您需要 DataFrame 的新列吗?如果是,这个新列的索引是什么? datetimeinvestment ?
  • ohhh。是的,我需要在Dataframe 中添加一个新列,新列的索引将具有Investment 上的值。谢谢! :)
  • 我创建了另一个解决方案,因为它完全不同。
  • 我的个人资料中有电子邮件;)但我不知道我是否有时间,但你可以给我发电子邮件。
猜你喜欢
  • 2013-05-18
  • 2019-10-28
  • 2014-10-03
  • 1970-01-01
  • 1970-01-01
  • 2013-05-20
  • 2015-12-09
  • 2019-09-28
  • 2016-09-28
相关资源
最近更新 更多