【发布时间】:2019-06-15 10:36:29
【问题描述】:
我有一个非常大的 CSV 文件,它已在 Python 中作为 dask 数据框导入。我制作了一个小数据框来解释我的问题。
import dask.dataframe as dd
df = dd.read_csv("name and path of the file.csv")
df.head()
输出:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A |2001| 2 | 5 |
+----+----+----+----+
| A |2001| 2 | 4 |
+----+----+----+----+
| A |2001| 3 | 6 |
+----+----+----+----+
| A |2002| 4 | 5 |
+----+----+----+----+
| B |2001| 2 | 9 |
+----+----+----+----+
| B |2001| 2 | 4 |
+----+----+----+----+
| B |2001| 2 | 3 |
+----+----+----+----+
| B |2001| 3 | 95 |
+----+----+----+----+
如果col1 中的对应值相同,我想添加另一列col3_mean,其中包含col3 的平均值。
+----+----+----+----+---------+
|col1|col2|col3|col4|col3_mean|
+----+----+----+----+---------+
| A |2001| 2 | 5 | 2.75 |
+----+----+----+----+---------+
| A |2001| 2 | 4 | 2.75 |
+----+----+----+----+---------+
| A |2001| 3 | 6 | 2.75 |
+----+----+----+----+---------+
| A |2002| 4 | 5 | 2.75 |
+----+----+----+----+---------+
| B |2001| 2 | 9 | 2.25 |
+----+----+----+----+---------+
| B |2001| 2 | 4 | 2.25 |
+----+----+----+----+---------+
| B |2001| 2 | 3 | 2.25 |
+----+----+----+----+---------+
| B |2001| 3 | 95 | 2.25 |
+----+----+----+----+---------+
我知道pandas中的这个任务,我们可以使用:
df['col3_mean'] = df.groupby(['col1'])['col3'].transform('mean')
我在 dask 中使用了以下代码,但它为 col3_mean 返回了 Nan 值
df['col3_mean'] = df.groupby(df.col1).col3.mean()
我还使用了df['index'] = df.groupby(df.lable).col3.mean().collect(),但它不起作用。此外,以下行仅返回 pandas.core.series.Series
df.groupby(df.col1).col3.mean().collect()
【问题讨论】:
标签: python dataframe mean aggregation dask