在 Python Pandas 中执行 groupby 和聚合答案

【问题标题】：Perform a groupby and aggregation in Python Pandas在 Python Pandas 中执行 groupby 和聚合
【发布时间】：2015-11-27 19:34:15
【问题描述】：

我有一个看起来像的数据框

user    time15min             name                  is_purchase
A       2015-08-18 16:45:00   Words With Friends    0
A       2015-08-18 16:45:00   Clash of Clans        0
A       2015-08-18 16:45:00   Words With Friends    0
A       2015-08-18 16:45:00   Clash of Clans        1
A       2015-08-18 17:00:00   Sudoku                0
B       2015-08-18 17:00:00   Angry Birds           0
B       2015-08-18 17:00:00   Candy Crush           0
B       2015-08-18 17:00:00   Candy Crush           0
....

time15min 列包含用户在手机中玩游戏的 15 分钟存储桶。

我需要做的是为每个用户和每个 time15min 时段创建一个聚合数据框，其中有一列显示玩得最多的游戏以及在此期间是否有任何应用内购买。

所以，结果会是这样的

 user   time15min             name                  purchase_made
  A     2015-08-18 16:45:00   Clash of Clans        1
  A     2015-08-18 17:00:00   Sudoku                0
  B     2015-08-18 17:00:00   Candy Crush           0

如果 A 的第一种情况出现平局，我们可以只取第一个字母顺序的平局（在这种情况下是《部落冲突》）。

【问题讨论】：

标签： python python-2.7 pandas group-by pandasql

【解决方案1】：

你可以从here申请配方

import pandas as pd
## read in your data from clipboard and get the columns right
df = pd.read_clipboard(sep='\s{2,}')

df.loc[:,'time15min'] = pd.to_datetime(df['time15min'])

## set the index to time15min, so df2 has a DateTimeIndex
df2 = df.set_index('time15min')

## Use .agg to count the names and total the purchases
df3=df2.groupby(['user',pd.TimeGrouper('15min'),'name']).agg({
                           'name':'count','is_purchase':'sum'})

## Create a mask to find the max for each group
mask = df3.groupby(level=[0,1]).agg('idxmax')
df3_count = df3.loc[mask['name']]

df3_count

这给出了以下结果：

                                           name is_purchase
user    time15min           name        
A   2015-08-18 16:45:00     Clash of Clans  2   1
    2015-08-18 17:00:00     Sudoku          1   0
B   2015-08-18 17:00:00     Candy Crush     2   0

【讨论】：

你是如何在这里创建 df2 的？我不太明白从 df 创建 df2 的部分。
抱歉有错字：应该是df2 = df.set_index('time15min')