熊猫：立即查找重复记录并求和答案

【问题标题】：Pandas: Find immediate following duplicate record and sum熊猫：立即查找重复记录并求和
【发布时间】：2021-07-30 18:44:29
【问题描述】：

使用在线数据集，用户上传要出售的商品但立即将其删除，我只想识别紧跟在同一原始条目之后的条目。

activityTime       customerId  itemId activityType  dollarValue
2000-01-01-10:23        101     p101        add        10.32   
2000-01-01-10:25        101     p102     remove        10.32
2000-01-03-11:45        101     p102        add        10.32
2000-01-04-11:46        101     c101        add        11.00
2000-01-03-09:32        300     c201        add        69.34
2000-01-03-13:33        300     c301        add        23.54
2000-01-04-15:12        300     c401        add        79.25
2000-01-04-15:16        300     c401     remove        79.25
2000-01-05-16:32        300     c401        add        79.25

目标是从上面获取以下记录：

2000-01-01-10:25        101     p102     remove        10.32
2000-01-04-15:16        300     c401     remove        79.25

删除列不可信，因此这是我执行的步骤：

dups = df[df.duplicated(['customerId', 
                           'itemId', 
                           'dollarValue'], 
                          keep=False)]

然后通过activityType分开

df_add = dup[dup.activityType == 'add']
df_remove = dup[dup.activityType == 'remove']

然后通过这些键合并，假设事情会正确排列，但事实并非如此，并且事情最终与删除后添加的相同项目排列在一起。

df_add_remove = pd.merge(
    df_add, df_remove, 
    on=['customerId', 'itemId', 'dollarValue'], 
    how='inner'
).filter(['customerId',
          'activityTime_x', 
          'activityType_x',
          'activityTime_y', 
          'activityType_y', 
          'dollarValue']).rename(
    columns={'activityTime_x':'addDateTime',
             'transactionDateTime_y':'removeDateTime'}
)

在添加项目后几分钟内，删除项目应立即存在。客户可以在以后再次添加此项目。

看起来合并不是一个好方法，什么是完成这项任务的最佳 pythonic/pandas 方式。

【问题讨论】：

标签： python python-3.x pandas dataframe

【解决方案1】：

你可以试试这个：

df.activityTime = pd.to_datetime(df.activityTime)
df =  df.sort_values(['customerId','itemId','activityTime'])
def filter_product(x):
    if 'remove' in x['activityType'].values:
        x['diff_in_sec'] = (pd.to_timedelta(x.activityTime - x.activityTime.shift(1)).dt.total_seconds())
        return x[(x['activityType'] == 'remove') & (x['diff_in_sec'] < 600)]
removed_df = df.groupby(['customerId','itemId']).apply(filter_product).reset_index(drop=True)

【讨论】：

如果你有兴趣，有一个稍微不同的问题：stackoverflow.com/questions/67459880/…
我看到 time_dff 是负值，如果这个 shift(1) 应该在当前记录之后记录？