在 Spark 中合并多行答案

【问题标题】：combine multiple row in Spark在 Spark 中合并多行
【发布时间】：2020-02-07 15:20:22
【问题描述】：

我想知道是否有任何简单的方法可以在 Pyspark 中将多行合并为一个，我是 Python 和 Spark 的新手，并且大部分时间都在使用 Spark.sql。

这是一个数据示例：

id      count1    count2   count3
 1       null       1       null
 1        3         null    null
 1        null      null      5
 2        null      1        null
 2        1         null     null
 2        null      null      2

预期的输出是：

 id      count1    count2   count3
 1       3          1       5
 2       1         1        2

我多次使用 spark SQL 加入它们，想知道是否有更简单的方法。

谢谢！

【问题讨论】：

我不确定这是否是有意的，但在您的数据中，看起来每个 id 的列只有一个非空值？
如果每个id只有一个非空值，你可以用ignorenulls =True做groupBy + first。比如：df.groupBy('id').agg(*[first(c, True).alias(c) for c in df.columns[1:]])
或groupBy 与max : f.groupBy("id").agg(*[max(c).alias(c) for c in df.columns[1:]]).show()...
是的，只有一个空值。谢谢大家，我试试看！

标签： apache-spark pyspark pyspark-sql

【解决方案1】：

Spark SQL 会将 null 求和为零，因此如果您知道没有“重叠”数据元素，只需按您希望聚合到的列分组并求和。

假设您想保留原始列名（而不是对 id 列求和），您需要指定求和的列，然后在聚合后重命名它们。

before.show()                                                                                                                                                                                      
+---+------+------+------+
| id|count1|count2|count3|
+---+------+------+------+
|  1|  null|     1|  null|
|  1|     3|  null|  null|
|  1|  null|  null|     5|
|  2|  null|     1|  null|
|  2|     1|  null|  null|
|  2|  null|  null|     2|
+---+------+------+------+

after = before
   .groupby('id').sum(*[c for c in before.columns if c != 'id'])
   .select([col(f"sum({c})").alias(c) for c in before.columns if c != 'id'])

after.show()                                                                                                                                                                                       
+------+------+------+
|count1|count2|count3|
+------+------+------+
|     3|     1|     5|
|     1|     1|     2|
+------+------+------+

【讨论】：