pandas DataFrame 连接/更新（“upsert”）？答案

【问题标题】：pandas DataFrame concat / update ("upsert")?pandas DataFrame 连接/更新（“upsert”）？
【发布时间】：2015-10-07 20:11:24
【问题描述】：

我正在寻找一种优雅的方式将所有行从一个 DataFrame 附加到另一个 DataFrame（两个 DataFrame 具有相同的索引和列结构），但是如果两个 DataFrame 中出现相同的索引值，请使用来自的行第二个数据框。

所以，例如，如果我开始：

df1:
                    A      B
    date
    '2015-10-01'  'A1'   'B1'
    '2015-10-02'  'A2'   'B2'
    '2015-10-03'  'A3'   'B3'

df2:
    date            A      B
    '2015-10-02'  'a1'   'b1'
    '2015-10-03'  'a2'   'b2'
    '2015-10-04'  'a3'   'b3'

我希望结果是：

                    A      B
    date
    '2015-10-01'  'A1'   'B1'
    '2015-10-02'  'a1'   'b1'
    '2015-10-03'  'a2'   'b2'
    '2015-10-04'  'a3'   'b3'

这类似于我认为在某些 SQL 系统中所谓的“upsert”——更新和插入的组合，因为df2 中的每一行要么（a）用于更新现有行如果行键已经存在于df1 中，则在df1 中，或者（b）如果行键不存在，则在末尾插入df1。

我想出了以下内容

pd.concat([df1, df2])     # concat the two DataFrames
    .reset_index()        # turn 'date' into a regular column
    .groupby('date')      # group rows by values in the 'date' column
    .tail(1)              # take the last row in each group
    .set_index('date')    # restore 'date' as the index

这似乎可行，但这取决于每个 groupby 组中的行顺序始终与原始 DataFrames 相同，我没有检查过，而且看起来令人不快地令人费解。

有人对更直接的解决方案有任何想法吗？

【问题讨论】：

标签： python pandas

【解决方案1】：

一种解决方案是将df1 与df2 中的新行连接起来（即索引不匹配的位置）。然后使用来自df2 的值更新值。

df = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
df.update(df2)

>>> df
             A   B
2015-10-01  A1  B1
2015-10-02  a1  b1
2015-10-03  a2  b2
2015-10-04  a3  b3

编辑： 根据@chrisb 的建议，这可以进一步简化如下：

pd.concat([df1[~df1.index.isin(df2.index)], df2])

谢谢克里斯！

【讨论】：

不错。我也想知道效率。这种方法显然比我的 groupby 解决方案更有效，但看起来它仍然需要多次传递 df1 和 df2 中的数据（我的意思是，就 pandas 在内部必须做的事情而言）。如果有人对更有效的方法有想法，我很想听听！
你可以通过相反的顺序来避免更新； pd.concat([df1[~df1.index.isin(df2.index)], df2])
@embeepea 好 YMMV。但这实际上非常有效，涉及一组操作（在索引上）和 1 次拍摄（索引），以及一个副本（连接）。例如。 1MM 行，在我的机器上需要 150 毫秒。

【解决方案2】：

截至pandas 1.0.3，所需的UPSERT 功能直接由combine_first 提供：

combined = df2.combine_first(df1)

print(combined)
#               A   B
# 2015-10-01    A1  B1
# 2015-10-02    a1  b1
# 2015-10-03    a2  b2
# 2015-10-04    a3  b3

要获得这种 UPSERT 行为，其数据具有优先级的数据帧（更新的数据帧，在本例中为 df2）必须是调用该函数的数据帧。

它基本上：(1) 协调行和列，(2) 优先考虑非 NaN 数据，以及 (3) 如果在两个数据帧中定义的数据点，则优先考虑 df2 中的数据，这基本上就是你的想要。

【讨论】：

UPSERT 操作是逐行插入或替换。 combine_first 操作是逐个值的。这些不是等效的操作。如果您使用 UPSERT，则新行将完全替换现有行。如果您使用 combine_first，那么新行的非空值将仅替换现有行的空值（并且所有现有的非空值都将保留在原地）。

【解决方案3】：

除了正确答案之外，还要注意两个数据框中都不存在的列：

    df1 = pd.DataFrame([['test',1, True], ['test2',2, True]]).set_index(0)
    df2 = pd.DataFrame([['test2',4], ['test3',3]]).set_index(0)

如果您按原样使用上述解决方案，您会得到：

    >>>     1   2
    0       
    test    1   True
    test2   4   NaN
    test3   3   NaN

但如果您期待以下输出：

    >>>     1   2
    0       
    test    1   True
    test2   4   True
    test3   3   NaN

只需将语句更改为：

    df1 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
    df1.update(df2)

【讨论】：