【发布时间】:2020-12-11 19:24:09
【问题描述】:
在 pyspark 数据帧的新列中生成列值总和和行总和的矩阵
colors = spark.createDataFrame([("Red","Re",20),("Blue","Bl",30),("Green","Gr",50)]).toDF("Colors","Prefix","Value")
+------+------+-----+
|Colors|Prefix|Value|
+------+------+-----+
| Red| Re| 20|
| Blue| Bl| 30|
| Green| Gr| 50|
+------+------+-----+
piv = colors.groupby("Colors").pivot("Prefix").sum("Value").fillna(0)
piv.withColumn("total",sum(piv[col] for col in piv.columns[1:])).show()
+------+---+---+---+-----+
|Colors| Bl| Gr| Re|total|
+------+---+---+---+-----+
| Green| 0| 50| 0| 50|
| Blue| 30| 0| 0| 30|
| Red| 0| 0| 20| 20|
+------+---+---+---+-----+
期望下面的列的总和(期望的动态代码,比如它有更多的列和行)
Re Bl Gr TOTAL
Red 20 0 0 20
Blue 0 30 0 30
Green 0 0 50 50
TOTAL 20 30 50 100
【问题讨论】:
标签: python dataframe apache-spark pyspark