【问题标题】:Python/Pandas: How to consolidate repeated rows with NaN in different columns?Python/Pandas:如何用 NaN 合并不同列中的重复行?
【发布时间】:2017-03-12 06:15:45
【问题描述】:

一定有更好的方法来做到这一点,请帮助我

这是我必须清理的一些数据的摘录,其中有几种“重复”行(并非所有行都是重复的):

df =

LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
   100 | ABC        | Paid       |         NaN |        34200 |
   100 | ABC        | Paid       |         724 |        34200 |
   200 | DEF        | Write Off  |         611 |         9800 |
   200 | DEF        | Write Off  |         611 |          NaN |
   300 | GHI        | Paid       |         NaN |       247112 |
   300 | GHI        | Paid       |         799 |          NaN |
   400 | JKL        | Paid       |         NaN |          NaN |
   500 | MNO        | Paid       |         444 |          NaN |

所以我有以下类型的重复案例:

  1. CreditScore 列中的 NaN 和有效值 (LoanID = 100)
  2. AnnualIncome 列中的 NaN 和有效值 (LoanID = 200)
  3. CreditScore 列中的 NaN 和有效值以及 AnnualIncome 列中的 NaN 和有效值(贷款 ID=300)
  4. Lo​​anID 400 和 500 是“正常”情况

所以,显然我想要的是有一个没有重复的数据框,例如:

LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
   100 | ABC        | Paid       |         724 |        34200 |
   200 | DEF        | Write Off  |         611 |         9800 |
   300 | GHI        | Paid       |         799 |       247112 |
   400 | JKL        | Paid       |         NaN |          NaN |
   500 | MNO        | Paid       |         444 |          NaN |

那么,我是如何解决这个问题的:

# Get the repeated keys:
rep = df['LoanID'].value_counts()
rep = rep[rep > 2]

# Now we get the valid number (we overwrite the NaNs)
for i in rep.keys():
    df.loc[df['LoanID'] == i, 'CreditScore']  = df[df['LoanID'] == i]['CreditScore'].max()
    df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()

# Drop duplicates   
df.drop_duplicates(inplace=True)

这行得通,正是我需要的,问题是这个数据帧有几个 100k 记录,所以这个方法需要“永远”,必须有一些方法可以做得更好,对吧?

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    按贷款 ID 分组,在上方和下方填写缺失值,并删除重复项似乎可行:

    df.groupby('LoanID').apply(lambda x: \
                                 fillna(method='ffill').\
                                 fillna(method='bfill').\
                                 drop_duplicates()).\
                         reset_index(drop=True).\
                         set_index('LoanID')
    #       CustomerID LoanStatus  CreditScore  AnnualIncome  
    #LoanID                                                             
    #100           ABC       Paid        724.0       34200.0       
    #200           DEF  Write Off        611.0        9800.0       
    #300           GHI       Paid        799.0      247112.0       
    #400           JKL       Paid          NaN           NaN       
    #500           MNO       Paid        444.0           NaN       
    

    【讨论】:

      猜你喜欢
      • 2020-07-24
      • 2021-10-10
      • 2021-07-16
      • 2022-08-16
      • 1970-01-01
      • 1970-01-01
      • 2020-04-10
      • 2019-03-25
      • 2017-03-07
      相关资源
      最近更新 更多