【问题标题】:Python removing rows with time conditionPython删除具有时间条件的行
【发布时间】:2023-03-24 11:33:01
【问题描述】:

我有 2 组 Dataframe,都具有唯一标识符和格式的日期时间数据

“2020-01-01 00:00:01”-日期时间和“12345”-唯一标识符和类型

第一个问题,DF1:

   DatetimeX            ID    Type
   2020-01-01 02:00:01 12345 C
   2020-01-01 02:00:03 12345 C
   2020-01-01 05:00:03 12345 C
   2020-01-01 05:03:05 12345 C
   2020-01-01 03:00:09 13333 D
   2020-01-01 02:00:09 12345 C
   2020-01-01 02:01:35 12345 C
   2020-01-01 02:10:35 12345 C
   2020-01-01 02:00:01 13333 D
   2020-01-01 02:05:35 13333 D
   2020-01-01 02:00:50 13333 E
   2020-01-01 02:00:01 12211 C
   2020-01-01 02:09:50 13333 E
   2020-01-01 02:11:50 13333 E

我想基于 ID 的第一个时间戳具有相同的“类型”,并在 10 分钟后删除这些行:

   DatetimeX            ID    Type
   2020-01-01 02:00:01 12345 C
   2020-01-01 05:00:03 12345 C
   2020-01-01 02:10:35 12345 C
   2020-01-01 03:00:09 13333 D
   2020-01-01 02:00:01 13333 D
   2020-01-01 02:00:50 13333 E
   2020-01-01 02:00:01 12211 C
   2020-01-01 02:11:50 13333 E

我尝试探索时间范围/日期范围,但找不到任何类似的编码概念。希望如果有人能指出我可以研究什么样的方式来探索而不是试图获得完整的解决方案。几年没接触过python,之前对它不熟悉。谢谢

更新了额外的数据行以获得更准确的示例

【问题讨论】:

    标签: python pandas dataframe datetime spyder


    【解决方案1】:

    添加样本输入数据并简化流程:

    Timestamp = pd.to_datetime
    data = [{'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 12345, 'Type': 'C'},
     {'DatetimeX': Timestamp('2020-01-01 02:00:03'), 'ID': 12345, 'Type': 'C'},
     {'DatetimeX': Timestamp('2020-01-01 05:00:03'), 'ID': 12345, 'Type': 'C'},
     {'DatetimeX': Timestamp('2020-01-01 05:03:05'), 'ID': 12345, 'Type': 'C'},
     {'DatetimeX': Timestamp('2020-01-01 03:00:09'), 'ID': 13333, 'Type': 'D'},
     {'DatetimeX': Timestamp('2020-01-01 02:00:09'), 'ID': 12345, 'Type': 'C'},
     {'DatetimeX': Timestamp('2020-01-01 02:01:35'), 'ID': 12345, 'Type': 'C'},
     {'DatetimeX': Timestamp('2020-01-01 02:10:35'), 'ID': 12345, 'Type': 'C'},
     {'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 13333, 'Type': 'D'},
     {'DatetimeX': Timestamp('2020-01-01 02:05:35'), 'ID': 13333, 'Type': 'D'},
     {'DatetimeX': Timestamp('2020-01-01 02:00:50'), 'ID': 13333, 'Type': 'E'},
     {'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 12211, 'Type': 'C'},
     {'DatetimeX': Timestamp('2020-01-01 02:09:50'), 'ID': 13333, 'Type': 'E'},
     {'DatetimeX': Timestamp('2020-01-01 02:11:50'), 'ID': 13333, 'Type': 'E'}]
    df1 = pd.DataFrame(data)
    
    
    col_raw = df1.columns
    while True:
        df1.sort_values(['ID', 'Type', 'DatetimeX'], inplace=True)
        df1['diff1_lt10min'] = df1.groupby(['ID', 'Type'])['DatetimeX'].diff().dt.seconds < 10 * 60
        df1['tag_group'] = (~df1['diff1_lt10min']).cumsum()
        if df1.duplicated('tag_group').sum()==0:
            break
        df1 = df1.merge((df1.groupby('tag_group')['DatetimeX'].first()
                   .reset_index()
                   .rename(columns={'DatetimeX':'DatetimeX_1st'})),
                  on='tag_group')
        df1['diff2_lt10min'] = (df1.DatetimeX - df1.DatetimeX_1st).dt.seconds < 10 * 60
        cond = df1['diff1_lt10min'] & df1['diff2_lt10min']
        df1 = df1.loc[~cond, col_raw]
    df1 = df1[col_raw]
    

    详情...

    # repeat
    col_raw = df1.columns
    df4 = df1.copy()
    n_round = 1
    while True:
        print('#'*20, f'round {n_round}', '#'*20)
        # step 1 sort the values & group by ['Type', 'ID'] calculate the DatetimeX's time diff
        # notice: the time-diff is not the actual wanted
        df = df4[col_raw].copy()
        df.sort_values(['ID', 'Type', 'DatetimeX'], inplace=True)
        df['diff'] = df.groupby(['Type', 'ID'])['DatetimeX'].diff()
        print('#'*10, 'step1', '#'*10)
        print(df)
    
        # step 2, create a tag column to store the first 10min gap from 'diff' column
        cond = False 
        cond |= df['diff'].dt.seconds > 10 * 60
        cond |= df['diff'].isnull()
        df['tag'] = np.where(cond, 1, 0)
        df['tag'] = df['tag'].cumsum().fillna(method = 'ffill')
        print('#'*10, 'step2', '#'*10)
        print(df)
    
        # step 3, use 'tag' to judge to stop the while loop or not
        # tag should be unique
        break_sign = df.tag.duplicated().sum()
        if break_sign == 0:
            break
        print('#'*10, 'step3', '#'*10)
        print(break_sign)
        
        # step 4:
            # create a 'DatetimeX_1st' with the 'tag' group's first DatetimeX
            # create a 'diff2' = 'DatetimeX' - 'DatetimeX_1st'
        df2 = df.reset_index().set_index('tag')
        df2['DatetimeX_1st'] = df.groupby('tag').first()['DatetimeX']
        df2['diff2'] = df2['DatetimeX'] - df2['DatetimeX_1st']
        print('#'*10, 'step4', '#'*10)
        print(df2)
        
        # step 5:
            # drop the True < 10min gaps records
            # 'diff' and 'diff2' should all < 10min
        cond = (df2['diff2'].dt.seconds < 10 * 60) & (df2['diff'].dt.seconds < 10 * 60)
        df3 = df2[~cond].copy()
        print('#'*10, 'step5', '#'*10)
        print(df3)
        
        
        # step 6:
            # reset index
        cols = 'tag DatetimeX   ID  Type'.split()
        df4 = df3.reset_index().set_index('index').sort_index()[cols]
        print('#'*10, 'step6', '#'*10)
        print(df4)
        
        n_round += 1
        print()
        
    # get result
    result = df[['DatetimeX', 'ID', 'Type']].copy()
    result.index.name = None
    print()
    print('#'*10, 'result', '#'*10)
    print(result)
    

    输出:

    #################### round 1 ####################
    ########## step1 ##########
                 DatetimeX     ID Type            diff
    11 2020-01-01 02:00:01  12211    C             NaT
    0  2020-01-01 02:00:01  12345    C             NaT
    1  2020-01-01 02:00:03  12345    C 0 days 00:00:02
    5  2020-01-01 02:00:09  12345    C 0 days 00:00:06
    6  2020-01-01 02:01:35  12345    C 0 days 00:01:26
    7  2020-01-01 02:10:35  12345    C 0 days 00:09:00
    2  2020-01-01 05:00:03  12345    C 0 days 02:49:28
    3  2020-01-01 05:03:05  12345    C 0 days 00:03:02
    8  2020-01-01 02:00:01  13333    D             NaT
    9  2020-01-01 02:05:35  13333    D 0 days 00:05:34
    4  2020-01-01 03:00:09  13333    D 0 days 00:54:34
    10 2020-01-01 02:00:50  13333    E             NaT
    12 2020-01-01 02:09:50  13333    E 0 days 00:09:00
    13 2020-01-01 02:11:50  13333    E 0 days 00:02:00
    ########## step2 ##########
                 DatetimeX     ID Type            diff  tag
    11 2020-01-01 02:00:01  12211    C             NaT    1
    0  2020-01-01 02:00:01  12345    C             NaT    2
    1  2020-01-01 02:00:03  12345    C 0 days 00:00:02    2
    5  2020-01-01 02:00:09  12345    C 0 days 00:00:06    2
    6  2020-01-01 02:01:35  12345    C 0 days 00:01:26    2
    7  2020-01-01 02:10:35  12345    C 0 days 00:09:00    2
    2  2020-01-01 05:00:03  12345    C 0 days 02:49:28    3
    3  2020-01-01 05:03:05  12345    C 0 days 00:03:02    3
    8  2020-01-01 02:00:01  13333    D             NaT    4
    9  2020-01-01 02:05:35  13333    D 0 days 00:05:34    4
    4  2020-01-01 03:00:09  13333    D 0 days 00:54:34    5
    10 2020-01-01 02:00:50  13333    E             NaT    6
    12 2020-01-01 02:09:50  13333    E 0 days 00:09:00    6
    13 2020-01-01 02:11:50  13333    E 0 days 00:02:00    6
    ########## step3 ##########
    8
    ########## step4 ##########
         index           DatetimeX     ID Type            diff  \
    tag                                                          
    1       11 2020-01-01 02:00:01  12211    C             NaT   
    2        0 2020-01-01 02:00:01  12345    C             NaT   
    2        1 2020-01-01 02:00:03  12345    C 0 days 00:00:02   
    2        5 2020-01-01 02:00:09  12345    C 0 days 00:00:06   
    2        6 2020-01-01 02:01:35  12345    C 0 days 00:01:26   
    2        7 2020-01-01 02:10:35  12345    C 0 days 00:09:00   
    3        2 2020-01-01 05:00:03  12345    C 0 days 02:49:28   
    3        3 2020-01-01 05:03:05  12345    C 0 days 00:03:02   
    4        8 2020-01-01 02:00:01  13333    D             NaT   
    4        9 2020-01-01 02:05:35  13333    D 0 days 00:05:34   
    5        4 2020-01-01 03:00:09  13333    D 0 days 00:54:34   
    6       10 2020-01-01 02:00:50  13333    E             NaT   
    6       12 2020-01-01 02:09:50  13333    E 0 days 00:09:00   
    6       13 2020-01-01 02:11:50  13333    E 0 days 00:02:00   
    
              DatetimeX_1st           diff2  
    tag                                      
    1   2020-01-01 02:00:01 0 days 00:00:00  
    2   2020-01-01 02:00:01 0 days 00:00:00  
    2   2020-01-01 02:00:01 0 days 00:00:02  
    2   2020-01-01 02:00:01 0 days 00:00:08  
    2   2020-01-01 02:00:01 0 days 00:01:34  
    2   2020-01-01 02:00:01 0 days 00:10:34  
    3   2020-01-01 05:00:03 0 days 00:00:00  
    3   2020-01-01 05:00:03 0 days 00:03:02  
    4   2020-01-01 02:00:01 0 days 00:00:00  
    4   2020-01-01 02:00:01 0 days 00:05:34  
    5   2020-01-01 03:00:09 0 days 00:00:00  
    6   2020-01-01 02:00:50 0 days 00:00:00  
    6   2020-01-01 02:00:50 0 days 00:09:00  
    6   2020-01-01 02:00:50 0 days 00:11:00  
    ########## step5 ##########
         index           DatetimeX     ID Type            diff  \
    tag                                                          
    1       11 2020-01-01 02:00:01  12211    C             NaT   
    2        0 2020-01-01 02:00:01  12345    C             NaT   
    2        7 2020-01-01 02:10:35  12345    C 0 days 00:09:00   
    3        2 2020-01-01 05:00:03  12345    C 0 days 02:49:28   
    4        8 2020-01-01 02:00:01  13333    D             NaT   
    5        4 2020-01-01 03:00:09  13333    D 0 days 00:54:34   
    6       10 2020-01-01 02:00:50  13333    E             NaT   
    6       13 2020-01-01 02:11:50  13333    E 0 days 00:02:00   
    
              DatetimeX_1st           diff2  
    tag                                      
    1   2020-01-01 02:00:01 0 days 00:00:00  
    2   2020-01-01 02:00:01 0 days 00:00:00  
    2   2020-01-01 02:00:01 0 days 00:10:34  
    3   2020-01-01 05:00:03 0 days 00:00:00  
    4   2020-01-01 02:00:01 0 days 00:00:00  
    5   2020-01-01 03:00:09 0 days 00:00:00  
    6   2020-01-01 02:00:50 0 days 00:00:00  
    6   2020-01-01 02:00:50 0 days 00:11:00  
    ########## step6 ##########
           tag           DatetimeX     ID Type
    index                                     
    0        2 2020-01-01 02:00:01  12345    C
    2        3 2020-01-01 05:00:03  12345    C
    4        5 2020-01-01 03:00:09  13333    D
    7        2 2020-01-01 02:10:35  12345    C
    8        4 2020-01-01 02:00:01  13333    D
    10       6 2020-01-01 02:00:50  13333    E
    11       1 2020-01-01 02:00:01  12211    C
    13       6 2020-01-01 02:11:50  13333    E
    
    #################### round 2 ####################
    ########## step1 ##########
                    DatetimeX     ID Type            diff
    index                                                
    11    2020-01-01 02:00:01  12211    C             NaT
    0     2020-01-01 02:00:01  12345    C             NaT
    7     2020-01-01 02:10:35  12345    C 0 days 00:10:34
    2     2020-01-01 05:00:03  12345    C 0 days 02:49:28
    8     2020-01-01 02:00:01  13333    D             NaT
    4     2020-01-01 03:00:09  13333    D 0 days 01:00:08
    10    2020-01-01 02:00:50  13333    E             NaT
    13    2020-01-01 02:11:50  13333    E 0 days 00:11:00
    ########## step2 ##########
                    DatetimeX     ID Type            diff  tag
    index                                                     
    11    2020-01-01 02:00:01  12211    C             NaT    1
    0     2020-01-01 02:00:01  12345    C             NaT    2
    7     2020-01-01 02:10:35  12345    C 0 days 00:10:34    3
    2     2020-01-01 05:00:03  12345    C 0 days 02:49:28    4
    8     2020-01-01 02:00:01  13333    D             NaT    5
    4     2020-01-01 03:00:09  13333    D 0 days 01:00:08    6
    10    2020-01-01 02:00:50  13333    E             NaT    7
    13    2020-01-01 02:11:50  13333    E 0 days 00:11:00    8
    
    ########## result ##########
                 DatetimeX     ID Type
    11 2020-01-01 02:00:01  12211    C
    0  2020-01-01 02:00:01  12345    C
    7  2020-01-01 02:10:35  12345    C
    2  2020-01-01 05:00:03  12345    C
    8  2020-01-01 02:00:01  13333    D
    4  2020-01-01 03:00:09  13333    D
    10 2020-01-01 02:00:50  13333    E
    13 2020-01-01 02:11:50  13333    E
    

    【讨论】:

    • 感谢您一步步分解。正在尝试该方法并在第 3 步遇到一些错误。显示错误消息“['tag'] 均不在列中”。知道我该如何解决这个问题吗?试过“df6 = df.reset_index() df6.set_index('tag')”但遇到长度不匹配的错误
    • 抱歉,第 4 步 @Ferris
    • 对不起,我的错误,代码中的错字
    • 您好,对不起,它运行良好,但我不小心拔掉了电源,我的代码不见了 T^T。无法验证
    • 您好,成功找回了我的代码并开始尝试。意识到在第一步之前,在 df = df4[col_raw].copy() 处,它实际上删除了一些行。例如我有 5 个不同日期的 ID 12345 数据,在 while 循环中运行后,删除了 2 个日期。试图找出它发生的原因,但似乎找不到它
    【解决方案2】:

    IIUC 你应该试试groupby:

    >>> df.groupby((df.Type != df.Type.shift()).cumsum(), as_index=False).first()
                DatetimeX     ID Type
    0 2020-01-01 02:00:01  12345    C
    1 2020-01-01 02:00:01  13333    D
    2 2020-01-01 02:00:50  13333    E
    3 2020-01-01 02:00:01  12211    C
    >>> 
    

    它按连续相同的值分组。

    【讨论】:

    • 您好!感谢您提供的指导。我尝试了代码并且它可以工作,但它似乎从我自己的数据集中删除了 3 小时差异的行。将需要修复 10 分钟的时间范围。需要删除第一个时间戳后 10 分钟内的任何数据
    【解决方案3】:

    根据你的说法我想根据ID的第1个时间戳相同的“Type”,并删除10分钟的行,我相信你可以使用groupby().transform()来识别第一个时间戳,然后使用布尔掩码:

    # also transform('min')
    first_timestamps = df.groupby(['ID','Type'])['DatetimeX'].transform('first')
    
    mask = df['DatetimeX'] - first_timestamps < pd.Timedelta('10Min')
    
    df[mask]
    

    但是,由于您的示例数据彼此之间的时间都在 10 分钟内,因此这不会减少任何内容。

    相反,如果我们将上面第二行中的10Min 更改为1S,我们就会得到预期的输出:

                DatetimeX     ID Type
    0 2020-01-01 02:00:01  12345    C
    4 2020-01-01 02:00:01  13333    D
    6 2020-01-01 02:00:50  13333    E
    7 2020-01-01 02:00:01  12211    C
    

    【讨论】:

    • 感谢您提供的帮助。我已经尝试了如上所示的代码,但遇到了一些问题。由于我有一组多达 3000 行的数据,所以当我使用代码进行测试时,会遇到 2 个场景。例如1。如果我一天内有 2 个数据,则两者都被删除。例如2。我还有 10 分钟内的数据
    猜你喜欢
    • 1970-01-01
    • 2017-12-22
    • 1970-01-01
    • 1970-01-01
    • 2021-10-15
    • 2017-09-16
    • 2021-10-23
    • 2021-08-17
    相关资源
    最近更新 更多