pandas - 合并两个数据框覆盖并指定要保留的列答案

【问题标题】：pandas - merge two data frames overwrite and specify which columns to keeppandas - 合并两个数据框覆盖并指定要保留的列
【发布时间】：2018-09-11 22:55:54
【问题描述】：

我正在尝试合并到熊猫数据框，尽管我想要的可能实际上不是合并。

我在两个框架中有两列匹配，一列共享可用于连接的唯一值。另一列有一个空字段和一个填充字段。

我想在匹配唯一字段的同时覆盖空字段，但只保留被覆盖的列，我不想要第二个 DataFrame 中的其余列。

希望下文能进一步解释

>>> animals = [{"animal" : "dog", "name" : "freddy", "food" : ""},{"animal" : "cat", "name" : "dexter", "food" : ""},{"animal" : "dog", "name" : "lou lou", "food" : ""}]
>>> foods = [{"name" : "freddy", "food" : "dog mix", "brand" : "doggys dog"},{"name" : "dexter", "food" : "fussy cat mix", "brand" : "fish fishy"},{"name" : "lou lou", "food" : "bones", "brand" : "i was a cow"}]
>>> a_pd = pd.DataFrame(animals)
>>> a_pd
  animal food     name
0    dog        freddy
1    cat        dexter
2    dog       lou lou
>>> f_pd = pd.DataFrame(foods)
>>> f_pd
         brand           food     name
0   doggys dog        dog mix   freddy
1   fish fishy  fussy cat mix   dexter
2  i was a cow          bones  lou lou
>>>
>>>
>>> animal_data = a_pd.merge(f_pd, on='name', how='left')
>>> animal_data
  animal food_x     name        brand         food_y
0    dog          freddy   doggys dog        dog mix
1    cat          dexter   fish fishy  fussy cat mix
2    dog         lou lou  i was a cow          bones
>>>

我应该只吃食物，我不想要品牌（还要注意这是示例数据，实时数据有更多列

想要的结果

>>> animal_data
  animal        name            food
0    dog      freddy         dog mix
1    cat      dexter   fussy cat mix
2    dog     lou lou           bones

【问题讨论】：

标签： python pandas

【解决方案1】：

用途：

animal_data = a_pd.merge(f_pd, on='name', how='left', suffixes=('_x','')).drop('food_x', axis=1)

输出：

  animal     name        brand           food
0    dog   freddy   doggys dog        dog mix
1    cat   dexter   fish fishy  fussy cat mix
2    dog  lou lou  i was a cow          bones

或者

a_pd[['animal','name']].merge(f_pd, how='left')

输出：

  animal     name        brand           food
0    dog   freddy   doggys dog        dog mix
1    cat   dexter   fish fishy  fussy cat mix
2    dog  lou lou  i was a cow          bones

【讨论】：

【解决方案2】：

您可以使用update

a_pd.set_index('name',inplace=True)
a_pd.update(f_pd.set_index('name'))
a_pd
Out[68]: 
        animal           food
name                         
freddy     dog        dog mix
dexter     cat  fussy cat mix
lou lou    dog          bones
a_pd.reset_index()
Out[69]: 
      name animal           food
0   freddy    dog        dog mix
1   dexter    cat  fussy cat mix
2  lou lou    dog          bones

或者我们使用map

a_pd.food=a_pd.name.map(f_pd.set_index('name').food)
a_pd
Out[74]: 
  animal           food     name
0    dog        dog mix   freddy
1    cat  fussy cat mix   dexter
2    dog          bones  lou lou

【讨论】：

非常感谢您。我发现使用map 比在中间合并-删除-重新排序-拧紧要干净得多。

【解决方案3】：

我要么尝试drop，要么只选择你想保留的列：

animal_data.drop(['food_x', 'brand'], axis=1, inplace=True)

或

animal_data = animal_data[['animal', 'name', 'food']]

【讨论】：

【解决方案4】：

最好合并不包含合并数据框中不需要的列的数据框视图。例如：

a_cols = ['animal', 'name']
f_cols = ['food', 'name']
a_pd[a_cols].merge(f_pd[f_cols], on='name', how='left')

如果处理非常大的数据帧，这可能会更快，并且可能会节省一些内存，因为只有相关的列会在合并中结转。

【讨论】：

我认为如果animal 出现在df 中，它会产生animal_x、animal_y 问题。