聚合数据框上的列，根据另一个数据框对其进行分组，而不合并它们答案

【问题标题】：Aggregate columns on a Dataframe grouping it according another dataframe without merging them聚合数据框上的列，根据另一个数据框对其进行分组，而不合并它们
【发布时间】：2019-08-31 20:05:05
【问题描述】：

我有两个数据框 df1 和 df2：

df1 有 column1、column2 并且它有很多行（~1000 万）
df2 有 column2，还有很多其他列，而且很短（~ 100 列和~ 1000 行）

我想要实现的是：

df1.merge(df2, on=column2).groupby(column1).agg($SomeAggregatingFunction)

但要避免合并操作，因为它会占用大量内存。

有什么方法可以获得这种行为？

【问题讨论】：

标签： python pandas pandas-groupby

【解决方案1】：

除非内存开销成为瓶颈，否则我预计这种方法可能会更慢。尽管如此，您是否尝试过基于对df1 执行groupby 操作后返回的column2 索引对df2 进行子集化？请参阅下面的示例了解我的意思。

我想另一种选择是考虑使用 map-reduce 框架（例如 pyspark）？

# two toy datasets
df1 = pd.DataFrame({i:np.random.choice(np.arange(10), size=20) for i in range(2)}).rename(columns={0:'col1',1:'col2'})
df2 = pd.DataFrame({i:np.random.choice(np.arange(10), size=5) for i in range(2)}).rename(columns={0:'colOther',1:'col2'})

# make sure we don't use values of col2 that df2 doesn't contain
df1 = df1[df1['col2'].isin(df2['col2'])]

# for faster indexing and use of .loc
df2_col2_idx = df2.set_index('col2')

# iterate over the groups rather than merge
for i,group in df1.groupby('col1'):
    subset = df2_col2_idx.loc[group.col2,:]

    # some function on the subset here
    # note 'i' is the col1 index
    print(i,subset.colOther.mean())

更新：将@max 对apply 的评论建议包含在群组中：

df1.groupby(column1).apply(lambda x: df2_col2_idx.loc[x[columns2],other_columns].agg($SomeAggregatingFunction))

【讨论】：

根据您的回复，我已经解决了这个问题：df1.groupby(column1).apply(lambda x: df2_col2_idx.loc[x[columns2],other_columns].agg($SomeAggregatingFunction))跨度>