对于 uniqueConcatenate,您可以使用 collect_set() 函数将列聚合到集合中。
例如:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
import spark.implicits._
case class Record(col1: Option[Int] = None, col2: Option[Int] = None, col3: Option[Int] = None)
val df: DataFrame = Seq(Record(Some(1), Some(1), Some(1)), Record(Some(1), None, Some(3)), Record(Some(1), Some(3), Some(3))).toDF("col1", "col2", "col3")
df.show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| 1|
| 1|null| 3|
| 1| 3| 3|
+----+----+----+
*/
df.agg(
concat_ws(". ", collect_set("col1")).as("col1"),
concat_ws(". ", collect_set("col2")).as("col2"),
concat_ws(". ", collect_set("col3")).as("col3")
).show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1|1. 3|1. 3|
+----+----+----+
*/
对于 uniqueCount,您可以以类似的方式使用countDistinct:
import org.apache.spark.sql.functions.countDistinct
df.agg(
countDistinct("col1").as("col1"),
countDistinct("col2").as("col2"),
countDistinct("col3").as("col3")
).show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
+----+----+----+
*/