【问题标题】:PANDAS: merging calculated data in groupby dataframe into main dataframePANDAS:将groupby数据帧中的计算数据合并到主数据帧中
【发布时间】:2015-10-11 22:53:16
【问题描述】:

第一次在这里发帖,如果我没有完全正确地回答这个问题,请道歉。花了很多年在 Excel 和 PowerPivot 中处理数据,但当前的项目需要一些具有更繁重功能的东西。一直在看 Pandas,认为它可以胜任这项工作,但我被困住了。

我正在尝试计算每个客户的购买间隔天数

我的初始数据框如下所示:

    customer_id date        invoice_amt 
0   101A        21/03/2012  654.76      
1   101A        1/02/2012   234.45      
2   102A        23/01/2012  99.45       
3   104B        18/12/2011  767.63      
4   101A        9/12/2011   124.76      
5   104B        27/11/2011  346.87      
6   102A        18/11/2011  652.65      
7   104B        12/10/2011  765.21      
8   101A        1/10/2011   275.76      
9   102A        21/09/2011  532.21  

我的目标数据框如下所示:

customer_id date        invoice_amt days_since  
0   101A        21/03/2012  654.76      49
1   101A        1/02/2012   234.45      54
2   102A        23/01/2012  99.45       66
3   104B        18/12/2011  767.63      21
4   101A        9/12/2011   124.76      69
5   104B        27/11/2011  346.87      46
6   102A        18/11/2011  652.65      58
7   104B        12/10/2011  765.21      NaN
8   101A        1/10/2011   275.76      NaN
9   102A        21/09/2011  532.21      NaN

我已经能够计算每个分组数据帧中的 days_since 值,但不确定如何将这些值返回到主数据帧 (data_df)

任何帮助将不胜感激...谢谢

import pandas as pd
#import numpy as np

#dataframe data note: no_days_since_last_purchase hard coded for testing purposes
my_data = {'customer_id' : ['101A', '101A', '102A', '104B', '101A', '104B', '102A', '104B', '101A', '102A' ],
          'date' : ['20120321','20120201','20120123','20111218','20111209','20111127','20111118','20111012','20111001','20110921'],
          'invoice_amt' : [654.76, 234.45, 99.45, 767.63, 124.76, 346.87, 652.65, 765.21, 275.76, 532.21 ],
          'no_days_since_last_purchase' : ['49', '54', '66', '21', '69', '46', '58', 'NaN', 'NaN', 'NaN']}

data_df = pd.DataFrame(my_data).sort_index(by='date',ascending=True)

#convert date str to date type
data_df['date'] = pd.to_datetime(data_df['date'].astype(str),format='%Y%m%d')

#group dataframe by customer_id  
grouped_data = data_df.groupby(['customer_id'])    

#for each row in each grouped dataframe calculate the difference in days between current and previous
#if there is no previous then use 2000-01-01 then convert to integer
for customer_id, group in grouped_data:
    group['days_since'] = (group['date'] - group['date'].shift().fillna(pd.datetime(2000,1,1))).astype('timedelta64[D]')
    print group

输出:

  customer_id       date  invoice_amt no_days_since_last_purchase  days_since
8        101A 2011-10-01       275.76                         NaN        4291
4        101A 2011-12-09       124.76                          69          69
1        101A 2012-02-01       234.45                          54          54
0        101A 2012-03-21       654.76                          49          49
  customer_id       date  invoice_amt no_days_since_last_purchase  days_since
9        102A 2011-09-21       532.21                         NaN        4281
6        102A 2011-11-18       652.65                          58          58
2        102A 2012-01-23        99.45                          66          66
  customer_id       date  invoice_amt no_days_since_last_purchase  days_since
7        104B 2011-10-12       765.21                         NaN        4302
5        104B 2011-11-27       346.87                          46          46
3        104B 2011-12-18       767.63                          21          21

哦,我明白了 SettingWithCopyWarning: 试图在 DataFrame 中的切片副本上设置一个值。 尝试改用 .loc[row_indexer,col_indexer] = value

任何关于我应该做些什么来避免这个警告的想法也将不胜感激。

【问题讨论】:

标签: python pandas


【解决方案1】:

使用transform 生成一个系列,其索引与您的原始df 对齐,然后您可以分配为一个新列,此外您不能使用astypedatetime64[ns] 转换为timedelta[D],所以您有一个调用to_timedelta的附加步骤:

In [193]:
data_df['days_since'] = data_df.groupby(['customer_id'])['date'].transform(lambda x: x - x.shift().fillna(pd.datetime(2000,1,1)))
data_df['days_since'] = pd.to_timedelta(data_df['days_since'])
data_df

Out[193]:
  customer_id       date  invoice_amt no_days_since_last_purchase  days_since
9        102A 2011-09-21       532.21                         NaN   4281 days
8        101A 2011-10-01       275.76                         NaN   4291 days
7        104B 2011-10-12       765.21                         NaN   4302 days
6        102A 2011-11-18       652.65                          58     58 days
5        104B 2011-11-27       346.87                          46     46 days
4        101A 2011-12-09       124.76                          69     69 days
3        104B 2011-12-18       767.63                          21     21 days
2        102A 2012-01-23        99.45                          66     66 days
1        101A 2012-02-01       234.45                          54     54 days
0        101A 2012-03-21       654.76                          49     49 days

编辑

实际上你可以像这样在返回的系列上调用to_timedelta

data_df['days_since'] = pd.to_timedelta(data_df.groupby(['customer_id'])['date'].transform(lambda x: x - x.shift().fillna(pd.datetime(2000,1,1))))

【讨论】:

    【解决方案2】:
    df_container = []
    for customer_id, group in grouped_data:
        group['days_since'] = (group['date'] - group['date'].shift().fillna(pd.datetime(2000,1,1))).astype('timedelta64[D]')
        df_container.append(group)
    
    data_df = pd.concat(df_container)
    

    也许这就是你想要的?

      customer_id       date  invoice_amt no_days_since_last_purchase  days_since
    8        101A 2011-10-01       275.76                         NaN        4291
    4        101A 2011-12-09       124.76                          69          69
    1        101A 2012-02-01       234.45                          54          54
    0        101A 2012-03-21       654.76                          49          49
    9        102A 2011-09-21       532.21                         NaN        4281
    6        102A 2011-11-18       652.65                          58          58
    2        102A 2012-01-23        99.45                          66          66
    7        104B 2011-10-12       765.21                         NaN        4302
    5        104B 2011-11-27       346.87                          46          46
    3        104B 2011-12-18       767.63                          21          21
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-05-24
      • 1970-01-01
      • 2018-08-06
      相关资源
      最近更新 更多