表达式 'id' 既不存在于 group by 中，也不是聚合函数答案

【问题标题】：Expression 'id' is neither present in the group by, nor is it an aggregate function表达式 'id' 既不存在于 group by 中，也不是聚合函数
【发布时间】：2019-01-10 21:56:56
【问题描述】：

使用 Scala 和 Spark 1.6.3，我的错误信息是：

org.apache.spark.sql.AnalysisException: expression 'id' is neither present in the group by, nor is it an aggregate function. 
Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

产生错误的代码是：

returnDf.withColumn("colName", max(col("otherCol"))

DataFrame returnDf 看起来像：

+---+--------------------+
| id|            otherCol|
+---+--------------------+
|1.0|[0.0, 0.217764172...|
|2.0|          [0.0, 0.0]|
|3.0|[0.0, 0.142646382...|
|4.0|[0.63245553203367...|

using sql syntax 时有一个解决方案。什么是使用我上面使用的语法的等效解决方案（即withColumn() 函数）

【问题讨论】：

所以您实际上是在寻找数组中的最大值，不是吗？如果是这种情况，您根本不能使用max（不是说它无论如何都可以应用于array<> 列）。在 2.4 中，您可以使用高阶函数，但在 1.6 中，您必须使用 udf，如 udf((xs: Seq[Double] => xs.max)。

标签： scala apache-spark apache-spark-sql

【解决方案1】：

在使用聚合函数之前，您需要先进行 groupBy： returnDf.groupBy(col("id")).agg(max("otherCol"))

【讨论】：

为了让其他人更容易阅读您的答案，请尝试对代码 sn-ps 使用代码格式。有关信息，请参阅stackoverflow.com/editing-help#code。

【解决方案2】：

问题在于max 是一个聚合函数，它返回一列的最大值，而不是该列中每一行中数组的最大值。

要获取数组的最大值，正确的解决方案是使用 UDF：

returnDf.withColumn("colName", udf((v : Seq[Double]) => v.max).apply(col("otherCol")))

【讨论】：