【问题标题】:How to pd.merge(..., on="column", ...) when processing data in chunks?分块处理数据时如何 pd.merge(..., on="column", ...) ?
【发布时间】:2019-06-08 11:12:28
【问题描述】:

我想合并两个数据帧,但是,由于大小,正确的数据帧必须分块处理。从第二次迭代开始(即当将 chunk2 合并到 df 中时),merge 会创建额外的列(参见 MWE),但是,我想合并到旧列中。

请注意,A 列中的(日期)整数不是 df 中的唯一索引。

import pandas as pd

df = pd.DataFrame({'A': [20170801, 20170801, 20170802, 20170901],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                'C': ['C0', 'C1', 'C2', 'C3'],
                'D': ['D0', 'D1', 'D2', 'D3']},
                index=[0, 1, 2, 3])

chunk1 = pd.DataFrame({'A': [20170801, 20170802, 4, 4],
                'E': ['B4', 'B5', 'B6', 'B7'],
                'F': ['C4', 'C5', 'C6', 'C7'],
                'G': ['D4', 'D5', 'D6', 'D7']},
                 index=[0, 1, 2, 3])

chunk2 = pd.DataFrame({'A': [20170901, 67, 68, 69],
                'E': ['B4', 'B5', 'B6', 'B7'],
                'F': ['C4', 'C5', 'C6', 'C7'],
                'G': ['D4', 'D5', 'D6', 'D7']},
                 index=[0, 1, 2, 3])

df = df.merge(chunk1, on='A', how='left')
print(df)

      A   B   C   D    E    F    G
0  20170801  B0  C0  D0   B4   C4   D4
1  20170801  B1  C1  D1   B4   C4   D4
2  20170802  B2  C2  D2   B5   C5   D5
3  20170901  B3  C3  D3  NaN  NaN  NaN

df = df.merge(chunk2, on='A', how='left')
print(df)

          A   B   C   D  E_x  F_x  G_x  E_y  F_y  G_y
0  20170801  B0  C0  D0   B4   C4   D4  NaN  NaN  NaN
1  20170801  B1  C1  D1   B4   C4   D4  NaN  NaN  NaN
2  20170802  B2  C2  D2   B5   C5   D5  NaN  NaN  NaN
3  20170901  B3  C3  D3  NaN  NaN  NaN   B4   C4   D4

输出应如下所示:

      A   B   C   D    E    F    G
0  20170801  B0  C0  D0   B4   C4   D4
1  20170801  B1  C1  D1   B4   C4   D4
2  20170802  B2  C2  D2   B5   C5   D5
3  20170901  B3  C3  D3   B4   C4   D4

【问题讨论】:

  • @jezrael,请考虑重新提出问题 - 您的链接没有提供解决方案(或者我无法在那里找到它。)。
  • 重新打开,没问题:)

标签: python pandas dataframe merge


【解决方案1】:

merge 允许更改重叠列的默认后缀。完成此操作后,您只需覆盖 NaN 值并删除现在无用的列。它需要一些额外的步骤,但很简单。

所以说:

df = df.merge(chunk1, on='A', how='left')    # merge chunk1
df = df.merge(chunk2, on='A', how='left', suffixes = ('', '_x'))   # for chunk2, tweak column names

print(df)      # control

          A   B   C   D    E    F    G  E_x  F_x  G_x
0  20170801  B0  C0  D0   B4   C4   D4  NaN  NaN  NaN
1  20170801  B1  C1  D1   B4   C4   D4  NaN  NaN  NaN
2  20170802  B2  C2  D2   B5   C5   D5  NaN  NaN  NaN
3  20170901  B3  C3  D3  NaN  NaN  NaN   B4   C4   D4

# override NaN in columns E, F, G
for col in (list('EFG')):
    col2 = col+'_x'
    df[col] = df.apply(lambda x: x[col2] if x[col] is numpy.NaN else x[col],
                         axis=1)


print(df)   # control

          A   B   C   D   E   F   G  E_x  F_x  G_x
0  20170801  B0  C0  D0  B4  C4  D4  NaN  NaN  NaN
1  20170801  B1  C1  D1  B4  C4  D4  NaN  NaN  NaN
2  20170802  B2  C2  D2  B5  C5  D5  NaN  NaN  NaN
3  20170901  B3  C3  D3  B4  C4  D4   B4   C4   D4

df.drop(columns=['E_x', 'F_x', 'G_x'], inplace = True)    # drop now useless columns

print(df)    # as expected

          A   B   C   D   E   F   G
0  20170801  B0  C0  D0  B4  C4  D4
1  20170801  B1  C1  D1  B4  C4  D4
2  20170802  B2  C2  D2  B5  C5  D5
3  20170901  B3  C3  D3  B4  C4  D4

【讨论】:

  • 不需要apply 循环。矢量化numpy.wherepandas.Series.where 可以工作,甚至combine_first。 OP 的需求类似于 SQL 的合并。请参阅:stackoverflow.com/questions/38152389/…
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-10-04
  • 2010-11-26
  • 2015-02-22
  • 1970-01-01
  • 2019-02-21
  • 1970-01-01
相关资源
最近更新 更多