【问题标题】:Pandas Groupby and cumsum with multiple conditions and columns - Python具有多个条件和列的 Pandas Groupby 和 cumsum - Python
【发布时间】:2020-10-22 23:55:42
【问题描述】:

我有以下数据框:

import pandas as pd

hits = {'id': ['A','A','A','A','A','A','B','B','B','C','C','C'],
        'datetime': ['2010-01-02 03:00:00','2010-01-02 03:00:14','2010-01-02 03:00:35','2010-01-02 03:00:38',
                    '2010-01-02 03:29:10','2010-01-02 03:29:35','2010-01-02 03:45:20','2010-01-02 06:10:05',
                    '2010-01-02 06:10:15','2010-01-02 07:40:15','2010-01-02 07:40:20','2010-01-02 07:40:25'],
        'uri_len': [10,20,25,15,20,10,20,25,15,30,40,45]
       }

df = pd.DataFrame(hits, columns = ['id', 'datetime','uri_len'])

df['datetime'] =  pd.to_datetime(df['datetime'], format='%Y-%m-%d %H:%M:%S')

print (df)

   id            datetime  uri_len
0   A 2010-01-02 03:00:00       10
1   A 2010-01-02 03:00:14       20
2   A 2010-01-02 03:00:35       25
3   A 2010-01-02 03:00:38       15
4   A 2010-01-02 03:29:10       20
5   A 2010-01-02 03:29:35       10
6   B 2010-01-02 03:45:20       20
7   B 2010-01-02 06:10:05       25
8   B 2010-01-02 06:10:15       15
9   C 2010-01-02 07:40:15       30
10  C 2010-01-02 07:40:20       40
11  C 2010-01-02 07:40:25       45

我想按会话对点击进行分组,使用id 作为按变量分组。对我来说,一个会话是超过 15 秒的不活动时间(从datetime 列计算),或者是uri_len 列的减少,并且在这两种情况下都比较连续命中。

我知道如何按每个条件单独分组:

df['session1'] = (df.groupby('id')['datetime']
               .transform(lambda x: x.diff().gt('15Sec').cumsum())
              )

df['session2'] = (df.groupby('id')['uri_len']
               .transform(lambda x: x.diff().lt(0).cumsum())
              ) 

有没有办法在同一行合并两个转换,所以输出直接是这样的?:

   id            datetime  uri_len  session
0   A 2010-01-02 03:00:00       10        0
1   A 2010-01-02 03:00:14       20        0
2   A 2010-01-02 03:00:35       25        1
3   A 2010-01-02 03:00:38       15        2
4   A 2010-01-02 03:29:10       20        3
5   A 2010-01-02 03:29:35       10        4
6   B 2010-01-02 03:45:20       20        0
7   B 2010-01-02 06:10:05       25        1
8   B 2010-01-02 06:10:15       15        2
9   C 2010-01-02 07:40:15       30        0
10  C 2010-01-02 07:40:20       40        0
11  C 2010-01-02 07:40:25       45        0

【问题讨论】:

    标签: python pandas pandas-groupby cumsum


    【解决方案1】:

    如果我理解正确,您要添加它们吗?

    df['session'] = df.groupby('id')['datetime'].transform(lambda x: 
    x.diff().gt('15Sec').cumsum()) + df.groupby('id')['uri_len'].transform(lambda x: 
    x.diff().lt(0).cumsum())
    

    更清晰的方式:

    s1 = df.groupby('id')['datetime'].transform(lambda x: 
    x.diff().gt('15Sec').cumsum())
    
    s2 = df.groupby('id')['uri_len'].transform(lambda x: x.diff().lt(0).cumsum())
    
    df['session'] = s1+s2
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2022-01-23
      • 2021-04-06
      • 1970-01-01
      • 2020-08-24
      • 2013-06-20
      • 2018-08-20
      • 2020-01-25
      相关资源
      最近更新 更多