【发布时间】:2017-03-12 06:15:45
【问题描述】:
一定有更好的方法来做到这一点,请帮助我
这是我必须清理的一些数据的摘录,其中有几种“重复”行(并非所有行都是重复的):
df =
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | NaN | 34200 |
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
200 | DEF | Write Off | 611 | NaN |
300 | GHI | Paid | NaN | 247112 |
300 | GHI | Paid | 799 | NaN |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
所以我有以下类型的重复案例:
- CreditScore 列中的 NaN 和有效值 (LoanID = 100)
- AnnualIncome 列中的 NaN 和有效值 (LoanID = 200)
- CreditScore 列中的 NaN 和有效值以及 AnnualIncome 列中的 NaN 和有效值(贷款 ID=300)
- LoanID 400 和 500 是“正常”情况
所以,显然我想要的是有一个没有重复的数据框,例如:
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
300 | GHI | Paid | 799 | 247112 |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
那么,我是如何解决这个问题的:
# Get the repeated keys:
rep = df['LoanID'].value_counts()
rep = rep[rep > 2]
# Now we get the valid number (we overwrite the NaNs)
for i in rep.keys():
df.loc[df['LoanID'] == i, 'CreditScore'] = df[df['LoanID'] == i]['CreditScore'].max()
df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()
# Drop duplicates
df.drop_duplicates(inplace=True)
这行得通,正是我需要的,问题是这个数据帧有几个 100k 记录,所以这个方法需要“永远”,必须有一些方法可以做得更好,对吧?
【问题讨论】: