【问题标题】:How to do outer merge of two dataframes for which column values are within a certain range?如何对列值在一定范围内的两个数据框进行外部合并?
【发布时间】:2021-12-02 11:46:25
【问题描述】:

这是this的后续问题

我有两个dataframes

print df_1

  timestamp      A          B
0 2016-05-15     0.020228   0.026572
1 2016-05-15     0.057780   0.175499
2 2016-05-15     0.098808   0.620986
3 2016-05-17     0.158789   1.014819
4 2016-05-17     0.038129   2.384590
5 2018-05-17     0.011111   9.999999


print df_2

  start                end  event    
0 2016-05-14   2016-05-16   E1
1 2016-05-14   2016-05-16   E2
2 2016-05-17   2016-05-18   E3

如果timestamp 介于startend 之间,我想合并df_1df_2 并在df_1 中获得event column

问题以及与this 问题的差异

1) events E1E2 具有相同的 startend

2) 同样在df_1 中,第 6 行不属于任何区间。

最后我想同时拥有这两个事件,对于没有任何事件的行有NA

所以我希望我得到的dataframe 是这样的

  timestamp      A          B         event
0 2016-05-15     0.020228   0.026572  E1
1 2016-05-15     0.057780   0.175499  E1
2 2016-05-15     0.098808   0.620986  E1
3 2016-05-15     0.020228   0.026572  E2 
4 2016-05-15     0.057780   0.175499  E2
5 2016-05-15     0.098808   0.620986  E2
6 2016-05-17     0.158789   1.014819  E3
7 2016-05-17     0.038129   2.384590  E3
8 2018-05-17     0.011111   9.999999  NA

【问题讨论】:

    标签: python python-3.x pandas


    【解决方案1】:
    import pandas as pd
    
    df_1 = pd.DataFrame({'timestamp':['2016-05-15','2016-05-15','2016-05-15','2016-05-17','2016-05-17','2018-05-17'],
                         'A':[1,1,1,1,1,1]})
    df_2 = pd.DataFrame({'start':['2016-05-14','2016-05-14','2016-05-17'],
                         'end':['2016-05-16','2016-05-16','2016-05-18'],
                         'event':['E1','E2','E3']})
    df_1.timestamp = pd.to_datetime(df_1.timestamp, format='%Y-%m-%d')
    df_2.start = pd.to_datetime(df_2.start, format='%Y-%m-%d')
    df_2.end = pd.to_datetime(df_2.end, format='%Y-%m-%d')
    
    # convert game_ref_dt to long format with all the dates in between, and do a left merge on date
    df_2_2 = pd.melt(df_2, id_vars='event', value_name='timestamp')
    df_2_2.timestamp = pd.to_datetime(df_2_2.timestamp)
    df_2_2.set_index('timestamp', inplace=True)
    df_2_2.drop('variable', axis=1, inplace=True)
    
    df_2_3 = df_2_2.groupby('event').resample('D').ffill().reset_index(level=0, drop=True).reset_index()
    
    df_2 = pd.merge(df_2, df_2_3)
    df_2 = df_2.drop(columns=['start', 'end'])
    
    df_1 = df_1.merge(df_2,on='timestamp',  how='left')
    
    print(df_1)
       timestamp  A event
    0 2016-05-15  1    E1
    1 2016-05-15  1    E2
    2 2016-05-15  1    E1
    3 2016-05-15  1    E2
    4 2016-05-15  1    E1
    5 2016-05-15  1    E2
    6 2016-05-17  1    E3
    7 2016-05-17  1    E3
    8 2018-05-17  1   NaN
    

    感谢this

    也是这个解决方案,但没有在最后一行给出NA

      import pandas as pd
    
    df_1 = pd.DataFrame({'timestamp':['2016-05-15','2016-05-15','2016-05-15','2016-05-17','2016-05-17','2018-05-17'],
                         'A':[1,1,1,1,1,1]})
    df_2 = pd.DataFrame({'start':['2016-05-14','2016-05-14','2016-05-17'],
                         'end':['2016-05-16','2016-05-16','2016-05-18'],
                         'event':['E1','E2','E3']})   
    
    df_try2 = pd.merge(df_1.assign(key=1), df_2.assign(key=1), on='key').query('timestamp >= start and timestamp <= end')    
    
    print(df_try2)
    
       timestamp  A  key      start        end event
    0  2016-05-15  1    1 2016-05-14 2016-05-16    E1
    1  2016-05-15  1    1 2016-05-14 2016-05-16    E2
    3  2016-05-15  1    1 2016-05-14 2016-05-16    E1
    4  2016-05-15  1    1 2016-05-14 2016-05-16    E2
    6  2016-05-15  1    1 2016-05-14 2016-05-16    E1
    7  2016-05-15  1    1 2016-05-14 2016-05-16    E2
    11 2016-05-17  1    1 2016-05-17 2016-05-18    E3
    14 2016-05-17  1    1 2016-05-17 2016-05-18    E3
    

    【讨论】:

      【解决方案2】:

      一个选项是来自pyjanitorconditional_join,它可以帮助抽象不等式连接:

      # pip install pyjanitor
      import pandas as pd
      import janitor
      
      (df_1.conditional_join(
               df_2, 
               ('timestamp', 'start', '>='), 
               ('timestamp', 'end', '<='), 
               how = 'left')
           .loc[:, ['timestamp', 'A', 'B', 'event']]
      )
         timestamp         A         B event
      0 2016-05-15  0.020228  0.026572    E1
      1 2016-05-15  0.020228  0.026572    E2
      2 2016-05-15  0.057780  0.175499    E1
      3 2016-05-15  0.057780  0.175499    E2
      4 2016-05-15  0.098808  0.620986    E1
      5 2016-05-15  0.098808  0.620986    E2
      6 2016-05-17  0.158789  1.014819    E3
      7 2016-05-17  0.038129  2.384590    E3
      8 2018-05-17  0.011111  9.999999   NaN
      

      【讨论】:

        猜你喜欢
        • 2018-03-13
        • 2014-07-19
        • 2021-11-04
        • 2021-01-19
        • 1970-01-01
        • 2014-05-30
        • 2020-08-21
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多