Pandas：如何从另一个数据框中的列中填充空值？答案

【问题标题】：Pandas: How to fill in null values from columns in another dataframe?Pandas：如何从另一个数据框中的列中填充空值？
【发布时间】：2016-10-27 07:21:55
【问题描述】：

我有一个数据框，其中某些基本列为 NULL（我需要进一步的机器学习工作）。我有另一个数据框，其中包含类似的数据，我想从中提取缺失值。

例如，df1 是主数据框

id     col1    col2     col3     col4    col5
1      A       AA       100      5.0     0.9
2      A       BB       150      4.2     0.5
3      A       CC       100      NaN     NaN
4      B       AA       300      NaN     NaN
5      B       BB       100      NaN     NaN
6      C       BB       50       3.4     0.6

我想在col4 和col5 中填充那些NaN 列的数据框可能像

id     col1    col3     col4    col5
100      A     100      4.5     1.0
101      A     100      3.5     0.8
103      B     300      5.0     0.5
105      B     300      5.5     0.8
106      B     100      5.3     0.2
107      C     100      3.0     1.2

所以，我在第二个 df 中没有 col2，并且我可以合并 col1 和 col2 列的重复项。所以，我必须选择col4 值最大的值来填充df1 中的对应值。

例如，df1 填入数据后的正确值为：

id     col1    col2     col3     col4    col5
1      A       AA       100      5.0     0.9
1      A       BB       150      4.2     0.5
1      A       CC       100      4.5     1.0
1      B       AA       300      5.5     0.8
1      B       BB       100      5.3     0.2
1      C       BB       50       3.4     0.6

我该怎么做？

【问题讨论】：

col5 中的最大值是否总是与col4 中的最大值出现在同一行中？
@unutbu 不一定

标签： python pandas join

【解决方案1】：

IIUC

df1.combine_first(
    df1.merge(
        df2.drop('id', 1).ix[df2.groupby(['col1', 'col3']).col4.idxmax()],
        on=['col1', 'col3'], how='left', suffixes=['_', '']
    )[['col4', 'col5']]
).reindex_axis(df1.columns, 1)

【讨论】：

【解决方案2】：

import numpy as np
import pandas as pd
nan = np.nan

df1 = pd.DataFrame({'col1': ['A', 'A', 'A', 'B', 'B', 'C'], 'col2': ['AA', 'BB', 'CC', 'AA', 'BB', 'BB'], 'col3': [100, 150, 100, 300, 100, 50], 'col4': [5.0, 4.2, nan, nan, nan, 3.4], 'col5': [0.9, 0.5, nan, nan, nan, 0.6], 'id': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'B', 'C'], 'col3': [100, 100, 300, 300, 100, 100], 'col4': [4.5, 3.5, 5.0, 5.5, 5.3, 3.0], 'col5': [1.0, 99, 0.5, 0.8, 0.2, 1.2], 'id': [100, 101, 103, 105, 106, 107]})

df2_max = df2.drop('id', axis=1).groupby(['col1','col3']).max()
df3 = pd.merge(df1[['col1','col3']], df2_max, 
               left_on=['col1','col3'], right_index=True, how='left')
result = df1.combine_first(df3)

产量

  col1 col2  col3  col4  col5  id
0    A   AA   100   5.0   0.9   1
1    A   BB   150   4.2   0.5   2
2    A   CC   100   4.5  99.0   3
3    B   AA   300   5.5   0.8   4
4    B   BB   100   5.3   0.2   5
5    C   BB    50   3.4   0.6   6

首先，find the max of df2's col4 and col5 columns for each value of col1,col3：

df2_max = df2.drop('id', axis=1).groupby(['col1','col3']).max()
#            col4  col5
# col1 col3            
# A    100    4.5  99.0
# B    100    5.3   0.2
#      300    5.5   0.8
# C    100    3.0   1.2

请注意第一行中的 99（而不是 0.8）。我改变了例子稍微，以表明 col4 最大值不必与 col5 最大值。

接下来，merge df1 and df2_max，在df1 的col1、col3 列和df2 的索引上：

df3 = pd.merge(df1[['col1','col3']], df2_max, 
               left_on=['col1','col3'], right_index=True, how='left')
#   col1  col3  col4  col5
# 0    A   100   4.5  99.0
# 1    A   150   NaN   NaN
# 2    A   100   4.5  99.0
# 3    B   300   5.5   0.8
# 4    B   100   5.3   0.2
# 5    C    50   NaN   NaN

这为我们提供了索引与df1 的索引匹配的DataFrame 中col4 和col5 的最大值。这允许我们使用df1.combine_first 用来自df3 的值填充NaN：

result = df1.combine_first(df3)

【讨论】：