【问题标题】:check if one dataframe exists in another检查一个数据框是否存在于另一个中
【发布时间】:2020-05-13 19:06:55
【问题描述】:

我有 2 个数据框 Overalldf2。 整体

Time                ID_1    ID_2               
2020-02-25 09:24:14 140209  81625000
2020-02-25 09:24:14 140216  91625000
2020-02-25 09:24:18 140219  80250000
2020-02-25 09:24:18 140221  90250000
25/02/2020 09:42:02     143982  39075000

df2

ID_1    ID_2            Time                  Match?
140209  81625000    25/02/2020 09:24:14    no_match
143983  44075000    25/02/2020 09:42:02    no_match
143982  39075000    25/02/2020 09:42:02    match
143984  39075000    25/02/2020 09:42:02    no_match

我想检查df2 是否存在于Overall 中,如果存在,同一行的df2.Match? 是否匹配。如果是,则返回一个新列,表示是,如果它没有说匹配,则返回否。

我试过了

Overall_1 = pds.merge(Overall, df2, on=….., how='left', indicator= 'Exist')
Overall_1.drop([...], inplace = True, axis =1 )
Overall_1['Exist']= np.where((Overall_1.Exist =='both') & (Overall_1.Match? == match), 'yes', 'no')

但出现错误

TypeError: Cannot perform 'rand_' with a dtyped [bool] array and scalar of type [float]

因此生成的 Overall_1 数据框应如下所示:

Time                ID_1    ID_2             Exist   
2020-02-25 09:24:14 140209  81625000     No
2020-02-25 09:24:14 140216  91625000     NaN
2020-02-25 09:24:18 140219  80250000     NaN
2020-02-25 09:24:18 140221  90250000     Nan
25/02/2020 09:42:02     143982  39075000     Yes

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    使用mergenp.select.

    import numpy as np
    #df1 = Overall
    df3 = pd.merge(df1,df2,on=['ID_1','ID_2','Time'],how='left',indicator='Exists')
    
    
    col1 = df3['Match?']
    col2 = df3['Exists']
    
    conditions = [(col1.eq('match') & (col2.eq('both'))),
                  (col1.eq('no_match') & (col2.eq('both')))
                 ]
    
    choices = ['yes','no']
    
    df3['Exists'] = np.select(conditions,choices,default=np.nan)
    

    print(df3.drop('Match?',axis=1))
    
    
                     Time    ID_1      ID_2 Exists
    0 2020-02-25 09:24:14  140209  81625000     no
    1 2020-02-25 09:24:14  140216  91625000    nan
    2 2020-02-25 09:24:18  140219  80250000    nan
    3 2020-02-25 09:24:18  140221  90250000    nan
    4 2020-02-25 09:42:02  143982  39075000    yes
    

    或者更简单地使用replace dict 和.merge

    df3 = pd.merge(df1,df2,on=['ID_1','ID_2','Time'],how='left')\
                                          .replace({'no_match' : 'no', 
                                                    'match' : 'yes'})\
                                          .rename(columns={'Match?' : 'Exists'})
    
    print(df3)
    
                     Time    ID_1      ID_2 Exists
    0 2020-02-25 09:24:14  140209  81625000     no
    1 2020-02-25 09:24:14  140216  91625000    NaN
    2 2020-02-25 09:24:18  140219  80250000    NaN
    3 2020-02-25 09:24:18  140221  90250000    NaN
    4 2020-02-25 09:42:02  143982  39075000    yes
    

    【讨论】:

      【解决方案2】:

      你可以试试: df_diff = pd.concat([Overall,df2]).drop_duplicates(keep=False)

      【讨论】:

      • 嗨,你的代码没有达到我想要的效果!
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2022-01-18
      • 2021-12-25
      • 2021-05-07
      • 1970-01-01
      • 2021-09-06
      • 1970-01-01
      相关资源
      最近更新 更多