【问题标题】:Remove rows with the help of pandas by conditions在 pandas 的帮助下按条件删除行
【发布时间】:2021-02-28 00:10:58
【问题描述】:

我创建了一些如下所示的数据:

import pandas as pd
d = {'Time': ['01.10.2019, 09:56:52', '01.10.2019, 09:57:15', '02.10.2019 09:57:23', '02.10.2019 10:02:58', '02.10.2019 13:11:58', '02.10.2019 13:22:55', '03.10.2019, 09:56:52', '03.10.2019, 09:57:15', '04.10.2019 09:57:23', '04.10.2019 10:02:58', '04.10.2019 13:11:58', '04.10.2019 13:22:55']
     ,'Action': ['Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed']
     ,'Name': ['Bayer', 'Bayer', 'ITM', 'ITM', 'ITM' , 'ITM', 'ITM', 'ITM', 'Treso', 'Treso', 'Geco' , 'Geco']}
df = pd.DataFrame(data=d)

     Time                    Action    Name
0    01.10.2019, 09:56:52    Opened    Bayer
1    01.10.2019, 09:57:15    Closed    Bayer
2    02.10.2019, 09:57:23    Opened    ITM
3    02.10.2019, 10:03:58    Closed    ITM
4    02.10.2019, 13:11:58    Opened    ITM
5    02.10.2019, 13:22:55    Closed    ITM
6    03.10.2019, 09:56:52    Opened    ITM
7    03.10.2019, 09:57:15    Closed    ITM
8    04.10.2019, 09:57:23    Opened    Treso
9    04.10.2019, 10:03:58    Closed    Treso
10    04.10.2019, 13:11:58    Opened    Geco
11    04.10.2019, 13:22:55    Closed    Geco

现在我想通过这些条件删除数据:

  • 如果打开和关闭之间的时间小于5分钟并且同名,则应该删除它
  • 如果有一个打开的动作和相同的名称,并且它在线路关闭后重复并且它在同一天 -> 它应该删除所有具有相同名称的内容 在第一次打开和最后一次打开之间。例如,应该删除第 2 行到第 5 行,但不要删除到第 7 行,因为它是在一天之后。

第二个条件例如:如果有这个输入:

     Time                    Action    Name
0    02.10.2019, 09:57:23    Opened    ITM
1    02.10.2019, 10:03:58    Closed    ITM
2    02.10.2019, 13:11:58    Opened    ITM
3    02.10.2019, 13:22:55    Closed    ITM
4    03.10.2019, 09:56:52    Opened    ITM
5    03.10.2019, 09:57:15    Closed    ITM

我的输出应该是这样的:

0    02.10.2019, 13:11:58    Opened    ITM
1    02.10.2019, 13:22:55    Closed    ITM
2    03.10.2019, 09:56:52    Opened    ITM
3    03.10.2019, 09:57:15    Closed    ITM

因为是次日所以从10月2日到3日,其他时间不到5分钟

但如果我们有这种情况:

0    02.10.2019, 09:57:23    Opened    ITM
1    02.10.2019, 10:03:58    Closed    ITM
2    02.10.2019, 13:11:58    Opened    ITM
3    02.10.2019, 13:22:55    Closed    ITM
4    02.10.2019, 09:56:52    Opened    ITM
5    02.10.2019, 09:57:15    Closed    ITM

除了第二行和第三行之外的所有数据都应该删除:

2    02.10.2019, 13:11:58    Opened    ITM
3    02.10.2019, 13:22:55    Closed    ITM

我希望的输出应该是这样的:

     Time                    Action    Name
0    02.10.2019, 09:57:23    Opened    ITM
3    02.10.2019, 13:22:55    Closed    ITM
4    03.10.2019, 09:56:52    Opened    ITM
5    03.10.2019, 09:57:15    Closed    ITM
6    04.10.2019, 09:57:23    Opened    Treso
7    04.10.2019, 10:03:58    Closed    Treso
8    04.10.2019, 13:11:58    Opened    Geco
9    04.10.2019, 13:22:55    Closed    Geco

我尝试了什么:

df_new = df.assign(group=pd.to_datetime(df["Time"]).diff().dt.seconds.gt(300).cumsum()).groupby(["group", 
                                                                                                    "Time", 
                                                                                                    "Action",
                                                                                                    "Name"]).first()

有人可以帮我吗?

【问题讨论】:

  • 开闭总是连续的吗?
  • 是的,它应该每次打开和关闭,所以它们应该是连续的
  • 看来shift() 可以处理
  • 感谢您的评论,然后我该如何添加第二个条件? :)
  • 你能有像23:59开门和00:04关门的东西吗?

标签: python pandas dataframe csv


【解决方案1】:

假设您的逻辑需要:

  1. 消除相隔不到 5 分钟的所有内容。
  2. 从 REMAINING 值中,删除一天内打开多个名称的名称:

根据您的数据:

import pandas as pd
d = {'Time': ['01.10.2019, 09:56:52', '01.10.2019, 09:57:15', '02.10.2019 09:57:23', '02.10.2019 10:02:58', '02.10.2019 13:11:58', '02.10.2019 13:22:55', '03.10.2019, 09:56:52', '03.10.2019, 09:57:15', '04.10.2019 09:57:23', '04.10.2019 10:02:58', '04.10.2019 13:11:58', '04.10.2019 13:22:55']
     ,'Action': ['Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed', 'Opened', 'Closed']
     ,'Name': ['Bayer', 'Bayer', 'ITM', 'ITM', 'ITM' , 'ITM', 'ITM', 'ITM', 'Treso', 'Treso', 'Geco' , 'Geco']}
df = pd.DataFrame(data=d)

首先进行一些转换和排序,以确保数据的顺序正确。

## convert time to datetime
df['Time']=pd.to_datetime(df['Time'], dayfirst=True)

# get date
df['Date'] = df['Time'].dt.date

## make sure it's sorted by Name, and time:
df = df.sort_values(['Name', 'Time'])

## get time till next action for same client
df['time_to_next_action'] = \
np.where(((df['Name']==df['Name'].shift(-1)) & (df['Action']=='Opened')), df['Time'].shift(-1) - df['Time'], 0)

## convert time difference to minutes
df['time_to_next_action'] = df['time_to_next_action'].dt.total_seconds()/60.0

# CONDITION 1:

#delete entries under 5 minutes:

df = df[np.where(
    (
    (
        (df['Action']=='Opened') & (df['time_to_next_action']<5)
    )
    |
    (
        (df['Action']=='Closed') & (df['time_to_next_action'].shift(1)<5)
    )
    ), False, True
    )]

### explanation:  
## two possibilities: 
# 1. if action is Open and time to next action is less than 5 minutes delete it 
# 2. if action is 'Close' and time delta from previous action is less than 5 minutes, delete it

### CONDITION 2 edited based on comments:

## keep only first 'Opened'
df_first_open = df[df['Action']=='Opened'].sort_values(['Name', 'Time']).drop_duplicates(subset=['Name', 'Date', 'Action'])

## keep only last 'Closed'
df_last_close = df[df['Action']=='Closed'].sort_values(['Name', 'Time'], ascending = False).drop_duplicates(subset=['Name', 'Date', 'Action'])

## combine and sort the two
df = pd.concat([df_first_open, df_last_close]).sort_values(['Name', 'Time'])

# OPTIONAL: you can drop the extra columns:
df = df.drop(columns=['Date', 'time_to_next_action'])

print(df)

新输出:

Time    Action  Name
10  2019-10-04 13:11:58 Opened  Geco
11  2019-10-04 13:22:55 Closed  Geco
2   2019-10-02 09:57:23 Opened  ITM
5   2019-10-02 13:22:55 Closed  ITM
8   2019-10-04 09:57:23 Opened  Treso
9   2019-10-04 10:02:58 Closed  Treso

为了记录,原始条件 2 是:

## get date/name combinations that had only 1 'Open' per day:
df_to_keep = df[df['Action']=='Opened'].groupby(['Name', 'Action', 'Date']).count().reset_index()
df_to_keep = df_to_keep[np.where(df_to_keep['Time']==1, True, False)]

# those are the ones you'll keep in final output:
df = pd.merge(df_to_keep[['Name', 'Date']], df, how='left', on=['Name', 'Date'])

【讨论】:

  • 谢谢,只有一个错误!只有当只有两个值(打开和关闭)时,才应删除不到 5 分钟。如果有两个以上的值,则取第一个打开状态和最后一个关闭状态,并将其间的全部删除。
  • 啊……那是另一回事。我编辑了答案
猜你喜欢
  • 2021-11-22
  • 2016-10-19
  • 2019-04-03
  • 2017-10-08
  • 1970-01-01
  • 1970-01-01
  • 2011-01-23
  • 1970-01-01
相关资源
最近更新 更多