【问题标题】:How to get distinct count of keys along with other aggregations in pandas如何在熊猫中获得不同的键数以及其他聚合
【发布时间】:2017-02-02 09:05:18
【问题描述】:

我的数据框(DF)看起来像这样

Customer_number Store_number   year   month   last_buying_date1  amount     
     1             20          2014    10      2015-10-07        100
     1             20          2014    10      2015-10-09        200
     2             20          2014    10      2015-10-20        100
     2             10          2014    10      2015-10-13        500

我想得到这样的输出

 year   month  sum_purchase count_purchases distinct customers 
 2014    10       900          4                  3

如何使用 Agg 和 group by 获得这样的输出。我目前正在使用一个 2 步组,但很难获得不同的客户。这是我的方法

#### Step 1 - Aggregating everything at customer_number, store_number level
aggregations = {
    'amount': 'sum',       
    'last_buying_date1': 'count',
    }
grouped_at_Cust = DF.groupby(['customer_number','store_number','month','year']).agg(aggregations).reset_index()
grouped_at_Cust.columns =   ['customer_number','store_number','month','year','total_purchase','num_purchase']


#### Step2 - Aggregating at year month level 


aggregations = {
    'total_purchase': 'sum',       
    'num_purchase': 'sum',
     size
    }

Monthly_customers =       grouped_at_Cust.groupby(['year','month']).agg(aggregations).reset_index()
Monthly_customers.colums = ['year','month','sum_purchase','count_purchase','distinct_customers']

我的斗争是在第二步。如何在第二个聚合步骤中包含大小?

【问题讨论】:

    标签: python pandas group-by aggregate


    【解决方案1】:

    您可以使用 groupby.agg 并提供函数 nunique 来返回组中唯一客户 ID 的数量。

    df_grp = df.groupby(['year', 'month'], as_index=False)                                 \
               .agg({'purchase_amt':['sum','count'], 'Customer_number':['nunique']})
    
    df_grp.columns = map('_'.join, df_grp.columns.values)
    
    df_grp
    


    以防万一,您在执行groupby 操作时尝试对它们进行不同的分组(省略某些列):

    df_grp_1 = df.groupby(['year', 'month']).agg({'purchase_amt':['sum','count']})       
    
    df_grp_2 = df.groupby(['Store_number', 'month', 'year'])['Customer_number'].agg('nunique')
    

    获取包含执行agg 操作的多索引列的第一级:

    df_grp_1.columns = df_grp_1.columns.get_level_values(1)
    

    将它们合并回用于分组的列的交集处:

    df_grp = df_grp_1.reset_index().merge(df_grp_2.reset_index().drop(['Store_number'], 
                                          axis=1), on=['year', 'month'], how='outer')
    

    将列重命名为新列:

    d = {'sum': 'sum_purchase', 'count': 'count_purchase', 'nunique': 'distinct_customers'}  
    
    df_grp.columns = [d.get(x, x) for x in df_grp.columns]
    df_grp
    

    【讨论】:

    • 谢谢@Nickil。但是我的客户被定义为 customer_number 和 store_number 的组合。我如何将它们结合起来做 nunique ?
    • purchase_amt 的总和/计数是否在不使用 store_number 作为分组对象之一的情况下计算?如果是这种情况,您需要为不同的选择执行两次groupby查看编辑
    • 请查看更新后的示例(已编辑问题)。客户不仅仅是 customer_number,而是 customer_number 和 store_number 的组合。因此,如果我可以连接 customer_number 和商店编号,并使用 nunique 实现您的解决方案,那将起作用。但是concat会导致其他问题。
    • 您的意思是对查看Customer_numberStore_number 列的所有唯一条目求和:df.groupby(['Store_number'])['Customer_number'].agg(['nunique']).sum().to_frame().T
    • 我想要实现的 SQL 等价物将是 ..Select month, year , sum(total_amount) as sum_purchase, sum(num_purchases) as count_purchase, count (*) as distinct_customers from (select customer_number , store_number, month, year, sum(amount) as total_amount , count(*) as num_purchase from original_table group by customer_number,store_number,month,year ) a group by month,year
    猜你喜欢
    • 1970-01-01
    • 2015-10-15
    • 2023-01-13
    • 2019-03-22
    • 2022-12-16
    • 2019-05-26
    • 1970-01-01
    • 2021-12-24
    相关资源
    最近更新 更多