【问题标题】:Pandas groupby multi conditions and date difference calculationPandas groupby 多条件和日期差异计算
【发布时间】:2021-02-18 02:35:16
【问题描述】:

我无法理解使用的方法。我有以下数据框:

df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
      'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
      'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
      'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
       }

df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
df

我需要:

  1. 按相同的“代码”分组,
  2. 检查“DESC”是否不相同
  3. 检查“类型”是否相同
  4. 计算满足前 2 个命令的日期之间的月差

预期的输出如下:

【问题讨论】:

  • 嗨,你自己试过什么?
  • 您好,我已经尝试创建一个数据透视表,其中代码、类型和日期作为索引,而 desc 作为等于 size() 的值。然后我有 df.groupby(level=0)['DATE'].transform(lambda x: x[0] - x[1]) 这是我弄错了...

标签: pandas datetime pandas-groupby pivot-table hierarchical-clustering


【解决方案1】:

以下代码使用.drop_duplicates().duplicated() 保留或丢弃数据框中具有重复值的行。

你会如何计算一个月的差异?一个月可以是 28、30 或 31 天。您可以将最终结果除以 30 并获得月数差异的指示。所以我暂时保留了几天。

import pandas as pd

df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
      'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
      'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
      'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
       }

df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')

# only keep rows that have the same code and type 
df = df[df.duplicated(subset=['CODE', 'TYPE'], keep=False)]

# throw out rows that have the same code and desc
df = df.drop_duplicates(subset=['CODE', 'DESC'], keep=False)

# find previous date
df = df.sort_values(by=['CODE', 'DATE'])
df['previous_date'] = df.groupby('CODE')['DATE'].transform('shift')

# drop rows that don't have a previous date
df = df.dropna()

# calculate the difference between current date and previous date
df['difference_in_dates'] = (df['DATE'] - df['previous_date'])

这会产生以下 df:

CODE        DATE        TYPE    DESC    previous_date   difference_in_dates
AACCBD      2020-07-21  PUB     OK      2020-07-16      5 days
BBLGLC70M   2019-09-25  PRI     OK      2019-05-16      132 days
BCCDN       2020-02-27  PUB     OK      2020-02-13      14 days

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2015-08-23
    • 1970-01-01
    • 2021-12-31
    • 2019-08-24
    • 2017-11-01
    • 2016-12-09
    • 2023-02-25
    • 1970-01-01
    相关资源
    最近更新 更多