结合两个熊猫数据框添加相应的值pt2答案

【问题标题】：Combine two pandas dataframes adding corresponding values pt2结合两个熊猫数据框添加相应的值pt2
【发布时间】：2018-05-28 21:00:26
【问题描述】：

所以我有这些 CSV 文件，我想组合如下：

file1.csv
Date,Time,Unique1,Common
blah,blah,55,92

file2.csv
Date,Time,Unique2,Common
blah,blah,12,25

我想要一个 pandas 数据框...

Date,Time,Unique1,Unique2,Common (order of columns doesn't matter)
blah,blah,55,12,117

.. 其中 92+25 是 117。

我发现了一篇与这篇文章标题完全相同的帖子，其中包含以下代码示例：

each_df = (pd.read_csv(f) for f in all_files)
full_df = pd.concat(each_df).groupby(level=0).sum()

这可以满足我的需要，只是它不继承日期和时间列。我想那是因为 sum() 不知道如何处理它。

我反而得到...

Unique1,Unique2,Common
<values as expected>

请帮助我通过日期和时间列。它们应该在每个文件中完全相同，所以我可以按“日期”和“时间”列索引数据。

提前致谢。

【问题讨论】：

你把blah,blah放在这里。但是日期和时间列是您要加入的索引吗？
嗨，马特，是的，这些是我想要索引的列。

标签： python pandas join merge concat

【解决方案1】：

我认为您正在寻找 merge 而不是 concat。如果将每个 csv 转换为数据框，您可以执行以下操作：

new_df = df2.merge(df1, on=['Date','Time'], how='inner')
new_df['Common'] = new_df['Common_x'] + new_df['Common_y']
new_df[['Date', 'Time','Unique1', 'Unique2' ,'Common']]
#output

   Date  Time  Unique1  Unique2  Common
0  blah  blah       55       12     117

你也可以试试这个：

one_line = df2.merge(df1, on=['Date','Time'], how='inner').\
set_index(['Date', 'Time','Unique1', 'Unique2']).sum(axis=1).reset_index().\
rename(columns = {0:'Common'})

#output

   Date  Time  Unique1  Unique2  Common
0  blah  blah       55       12     117

【讨论】：

感谢马特，我需要每次循环 10 个 CSV 来生成 1 个组合 CSV。我意识到这只是手动执行此概念的扩展，但最好的方法是选择所有 CSV，将它们转换为 DF，然后对它们执行此过程。
不幸的是，这是对您发布的问题的补充。标题是结合两个数据框。请编辑您的问题 - 但此时可能过于宽泛。如果您知道如何循环，则可以使用 new_df 或 one_line 并在需要时重复循环合并

【解决方案2】：

对于两个以上的数据框，这可能是一个更好的选择：

import pandas as pd
from functools import reduce

# We will be splitting the data into two groups
all_files1 = (pd.read_csv(f) for f in all_files)
all_files2 = (pd.read_csv(f) for f in all_files)

# Merge the data frames together dropping the 'Common' column and set an index
# Default is an inner join.
split_drop_common = reduce(lambda df1, df2 : df1.merge(df2, on=['Date','Time']),
                [df.drop(columns='Common') for df in all_files1]).set_index(['Date','Time'])
# set up the second group
stage = pd.concat(all_files2)

# Drop any of the unique columns and sum the 'Common' column
keep_columns = ['Date','Time','Common']
split_only_common = stage[keep_columns].groupby(['Date','Time']).sum()


# Join on indices. Default is an inner join.
# You can specify the join type with kwarg how='join type'
combine = split_drop_common.join(split_only_common)
combine

# Output

   Date  Time  Unique1  Unique2  Common
0  blah  blah       55       12     117

您可以阅读有关 reduce 函数的工作原理here。

【讨论】：

谢谢..如果您不介意的话，有几个问题.. 在 split_drop_common 中，df1 和 df2 是从哪里开始的？此外，非常见列并非都以“唯一”开头，它们实际上是唯一的名称......
编辑了代码以允许使用唯一名称，并提供了关于 python reduce 函数如何工作的答案的链接。