【问题标题】:Merge and match with conditions Pandas合并和匹配条件 Pandas
【发布时间】:2021-10-24 22:58:01
【问题描述】:

我有两个简单的数据框。我想合并两个 where special_date >= first_date 和

    ID   |  special_date | 
0   11   |   2019-04-06  |  
1   11   |   2019-04-09  |  
2   11   |   2019-06-03  |  
3   11   |   2019-03-11  |  

    ID   |   first_date  |  second_date |
0   11   |   2019-04-03  |  2019-04-09  |
1   11   |   2019-05-02  |  2019-05-14  |
2   11   |   2019-05-20  |  2019-06-05  |
3   11   |   2019-03-03  |  2019-03-07  |

期望的输出:

    ID   |   first_date  | special_date |  second_date |
0   11   |   2019-04-03  |  2019-04-09  |  2019-04-09  |
1   11   |   2019-05-02  |      NaN     |  2019-05-14  |
2   11   |   2019-05-20  |  2019-06-03  |  2019-06-05  |
3   11   |   2019-03-03  |      NaN     |  2019-03-07  |

【问题讨论】:

  • 让我们知道您的问题是否已经解决?需要对以下答案进行任何澄清吗?

标签: python pandas dataframe merge conditional-statements


【解决方案1】:

试试:

# convert columns if necessary:
df1["special_date"] = pd.to_datetime(df1["special_date"])
df2["first_date"] = pd.to_datetime(df2["first_date"])
df2["second_date"] = pd.to_datetime(df2["second_date"])

df2["tmp"] = df2.apply(
    lambda x: pd.date_range(x["first_date"], x["second_date"]), 1
)

df2 = (
    df2.explode("tmp")
    .merge(
        df1, left_on=["ID", "tmp"], right_on=["ID", "special_date"], how="outer"
    )
    .drop(columns="tmp")
)

print(df2.groupby(["ID", "first_date", "second_date"], as_index=False).max())

打印:

   ID first_date second_date special_date
0  11 2019-03-03  2019-03-07          NaT
1  11 2019-04-03  2019-04-09   2019-04-09
2  11 2019-05-02  2019-05-14          NaT
3  11 2019-05-20  2019-06-05   2019-06-03

【讨论】:

    【解决方案2】:

    你可以使用:

    如果还不是日期时间格式,则转换日期

    df1['special_date'] = pd.to_datetime(df1['special_date'])
    
    df2['first_date'] = pd.to_datetime(df2['first_date'])
    df2['second_date'] = pd.to_datetime(df2['second_date'])
    

    然后,用.merge() + .query() + groupby() + .max() 合并、过滤、分组和选择最大可能日期,如下:

    df_out = (df1.merge(df2, on='ID', how='right')
                 .query('(special_date >= first_date) & (special_date <= second_date)')
                 .groupby(['ID', 'first_date', 'second_date'], as_index=False)['special_date'].max()
                 .merge(df2, on=['ID', 'first_date', 'second_date'], how='right')
             )
    

    结果:

    print(df_out)
    
       ID first_date second_date special_date
    0  11 2019-04-03  2019-04-09   2019-04-09
    1  11 2019-05-02  2019-05-14          NaT
    2  11 2019-05-20  2019-06-05   2019-06-03
    3  11 2019-03-03  2019-03-07          NaT
    

    【讨论】:

      【解决方案3】:

      这是一种先合并后清理的方法。使用merge_asof 使用“first_date”进行合并,然后确保该值低于“second_date”。 先决条件:

      df1['special_date'] = pd.to_datetime(df1['special_date'])
      df2['first_date'] = pd.to_datetime(df2['first_date'])
      df2['second_date'] = pd.to_datetime(df2['second_date'])
      

      处理:

      df3 = pd.merge_asof(df1.sort_values(by='special_date'),
                          df2.sort_values(by='first_date'),
                          left_on='special_date',
                          right_on='first_date',
                          suffixes=['', '_drop']
                         ).drop(columns='ID_drop')
      df3['special_date'] = df3['special_date'].where(df3['special_date']<df3['second_date'])
      

      输出:

         ID special_date first_date second_date
      0  11          NaT 2019-03-03  2019-03-07
      1  11   2019-04-06 2019-04-03  2019-04-09
      2  11          NaT 2019-04-03  2019-04-09
      3  11   2019-06-03 2019-05-20  2019-06-05
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-05-19
        • 2020-01-20
        • 1970-01-01
        • 2018-02-04
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多