【问题标题】:How to compare two large dataframes to check new entries?如何比较两个大型数据框以检查新条目?
【发布时间】:2021-07-21 03:52:56
【问题描述】:

我有两个熊猫数据框(2020 年上半年和 2020 年下半年)以及他们购买的产品。有没有一种有效的方法来找出哪些客户在半 2 中买了东西,而他们在半 1 中没有购买?

h1_2020:

Customer Product
1 A
1 B
2 A
3 A

h2_2020:

Customer Product
1 A
2 B
3 C
3 A

所以根据上面两张表的输出应该是:

Customer Bought something new in half 2?
1 0
2 1
3 1

【问题讨论】:

  • 两个数据框的客户集是否相同?

标签: python pandas dataframe numpy


【解决方案1】:

使用pd.merge。如果您在 h2_2020 中有新客户,他们将出现在最终输出中:

>>> pd.merge(h1_2020, h2_2020, on=['Customer', 'Product'], indicator=True, how='right') \
      .query("_merge == 'right_only'") \
      .value_counts('Customer') \
      .reindex(set(h1_2020['Customer'].unique()).union(h2_2020['Customer'].unique()), fill_value=0)

Customer
1    0
2    1
3    1
dtype: int64

【讨论】:

    【解决方案2】:

    您可以使用设置的差异来比较值

       # dfh1: Dataframe for h1 and dfh2 is the DataFrame for h2
       product_by_customer = dict()
       customers = df_h1['Customer'].unique()
       for customer in customers:
           products_in_h1 = set(dfh1[dfh1['customer']==customer]['product'].values)
           products_in_h2 = set(dfh2[dfh2['customer']==customer]['product'].values)
           product_by_customer[customer] = list(products_in_h2.difference(products_in_h1))
       print(product_by_customer)
    
    

    【讨论】:

      【解决方案3】:

      加入,然后只检查来自h2_2020 的记录:

      h1h2 = h1_2020.merge(h2_2020, how='outer', indicator=True)
      
      #   Customer Product      _merge
      #0         1       A        both
      #1         1       B   left_only
      #2         2       A   left_only
      #3         3       A        both
      #4         2       B  right_only
      #5         3       C  right_only
      
      (h1h2.groupby('Customer')['_merge']
           .agg(lambda g: g.eq('right_only').sum())
           .rename('Newly Bought')
           .reset_index())
      
      #   Customer  Newly Bought
      #0         1             0
      #1         2             1
      #2         3             1
      

      【讨论】:

        【解决方案4】:

        如果两个数据集的客户集相同,那么您可以执行以下操作:

        h2_2020 = h2_2020.drop_duplicates()
        h1_2020 = h1_2020.drop_duplicates()
        
        h2_2020.groupby('Customer').apply(lambda x: len(x)) > h1_2020.groupby('Customer').apply(lambda x: len(x))
        

        【讨论】:

          【解决方案5】:

          我更改了数据框 h2_2020 添加客户 4 以防新客户进入

          # module
          
          import pandas as pd
          import numpy as np
          
          # make a dataset
          
          h1_2020={'customer':[1,1,2,3],
                   'product':'A B A A'.split()
                   }
          
          h2_2020={'customer':[1,2,3,3,4],
                   'product':'A B C A A'.split()
                   }
          
          h1_2020=pd.DataFrame(h1_2020)
          print(h1_2020)
          h2_2020=pd.DataFrame(h2_2020)
          print(h2_2020)
          

          所以我的结果有其他列指示客户是否是一半 2 中的新客户。

          # drop_duplicates
          
          dh1_2020=h1_2020.drop_duplicates()
          dh2_2020=h2_2020.drop_duplicates()
          
          # append and drop_duplicates to see what products customers bought during year
          
          a=dh2_2020.append(dh1_2020).drop_duplicates().sort_values(['customer','product'])
          print(a)
          
          # check wheter custumers bought new product in half 2 
          
          check=a.merge(dh1_2020, on=['customer','product'],indicator=True,how='outer')
          print(check) 
          # => 'left_only' means that the customer bought new product in half 2, but customer 4 is new customer in half 2
          
          '''
          #make a dummy column indicating a customer bought new product or not 
          and the other dummy column indicating if a customer is new customer in half 2 or not
          
          '''
          m1=pd.DataFrame(dh1_2020.customer.drop_duplicates())
          print(m1)
          m2=check.loc[check['_merge']=='left_only',['customer','_merge']].drop_duplicates()
          print(m2)
          
          whatwewant=m1.merge(m2,on='customer',how='outer', indicator='new_customer_hf2')
          whatwewant['_merge']=1*(whatwewant['_merge']=='left_only')
          whatwewant['new_customer_hf2']=1*(whatwewant['new_customer_hf2']=='right_only')
          print(whatwewant)
          
          #   customer  _merge  new_customer_hf2
          #0         1       0                 0
          #1         2       1                 0
          #2         3       1                 0
          #3         4       1                 1
          

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 2020-02-12
            • 2016-02-25
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2018-07-27
            • 1970-01-01
            相关资源
            最近更新 更多