如何在 dask 数据框中添加一列包含基于其他列中值的相似性的一列值的平均值答案

【问题标题】：How to add a column in a dask dataframe contains the mean of the values of one column based on the similarity of the values in other columns如何在 dask 数据框中添加一列包含基于其他列中值的相似性的一列值的平均值
【发布时间】：2019-06-15 10:36:29
【问题描述】：

我有一个非常大的 CSV 文件，它已在 Python 中作为 dask 数据框导入。我制作了一个小数据框来解释我的问题。

import dask.dataframe as dd
df = dd.read_csv("name and path of the file.csv")
df.head()

输出：

 +----+----+----+----+
 |col1|col2|col3|col4|
 +----+----+----+----+
 |  A |2001|  2 |  5 |
 +----+----+----+----+
 |  A |2001|  2 |  4 |
 +----+----+----+----+
 |  A |2001|  3 |  6 |
 +----+----+----+----+
 |  A |2002|  4 |  5 |
 +----+----+----+----+
 |  B |2001|  2 |  9 |
 +----+----+----+----+
 |  B |2001|  2 |  4 |
 +----+----+----+----+
 |  B |2001|  2 |  3 |
 +----+----+----+----+
 |  B |2001|  3 | 95 |
 +----+----+----+----+

如果col1 中的对应值相同，我想添加另一列col3_mean，其中包含col3 的平均值。

 +----+----+----+----+---------+
 |col1|col2|col3|col4|col3_mean|
 +----+----+----+----+---------+
 |  A |2001|  2 |  5 |   2.75  |
 +----+----+----+----+---------+
 |  A |2001|  2 |  4 |   2.75  |
 +----+----+----+----+---------+
 |  A |2001|  3 |  6 |   2.75  |
 +----+----+----+----+---------+
 |  A |2002|  4 |  5 |   2.75  |
 +----+----+----+----+---------+
 |  B |2001|  2 |  9 |   2.25  |
 +----+----+----+----+---------+
 |  B |2001|  2 |  4 |   2.25  |
 +----+----+----+----+---------+
 |  B |2001|  2 |  3 |   2.25  |
 +----+----+----+----+---------+
 |  B |2001|  3 | 95 |   2.25  |
 +----+----+----+----+---------+

我知道pandas中的这个任务，我们可以使用：

df['col3_mean'] = df.groupby(['col1'])['col3'].transform('mean')

我在 dask 中使用了以下代码，但它为 col3_mean 返回了 Nan 值

df['col3_mean'] = df.groupby(df.col1).col3.mean()

我还使用了df['index'] = df.groupby(df.lable).col3.mean().collect()，但它不起作用。此外，以下行仅返回 pandas.core.series.Series

df.groupby(df.col1).col3.mean().collect()

【问题讨论】：

标签： python dataframe mean aggregation dask

【解决方案1】：

发布我的问题后，我可以找出答案：

s = df.groupby(df.col1).col3.mean().compute()
#s is pandas series
df['col3_mean'] = df['col1'].map(s)

但是，它不适用于我的大型数据框。它永远运行，我必须重新启动计算机。

如果您有其他解决方案，请告诉我

【讨论】：

【解决方案2】：

以下代码适用于我的大数据

agg = df.groupby(['lable']).open_cap.aggregate(["mean"])
agg.columns = ['col3_mean']
df = df.merge(agg.reset_index(), on="lable", how="left")

如果您有任何问题，也请添加您对此问题的答案。

【讨论】：