【发布时间】:2019-04-20 14:49:52
【问题描述】:
我正在使用一些在 scala 中返回 true/false 的业务逻辑在 spark 数据框中添加一列。该实现是使用 UDF 完成的,并且 UDF 有超过 10 个参数,因此我们需要先注册 UDF,然后才能使用它。以下已完成
spark.udf.register("new_col", new_col)
// writing the UDF
val new_col(String, String, ..., Timestamp) => Boolean = (col1: String, col2: String, ..., col12: Timestamp) => {
if ( ... ) true
else false
}
现在,当我尝试编写以下 spark/Scala 作业时,它无法正常工作
val result = df.withColumn("new_col", new_col(col1, col2, ..., col12))
我收到以下错误
<console>:56: error: overloaded method value udf with alternatives:
(f: AnyRef,dataType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF10[_, _, _, _, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF9[_, _, _, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF8[_, _, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF7[_, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF6[_, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF5[_, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF4[_, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF3[_, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF2[_, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF1[_, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF0[_],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and> ...
另一方面,如果我创建一个临时视图并使用 spark.sql,它可以像下面这样完美地工作
df.createOrReplaceTempView("data")
val result = spark.sql(
s"""
SELECT *, new_col(col1, col2, ..., col12) AS new_col FROM data
"""
)
我错过了什么吗?在 spark/scala 中进行此类查询的方法是什么?
【问题讨论】:
标签: scala apache-spark apache-spark-sql