将 udf 应用于多个列并使用 numpy 操作答案

【问题标题】：apply udf to multiple columns and use numpy operations将 udf 应用于多个列并使用 numpy 操作
【发布时间】：2020-01-29 16:54:18
【问题描述】：

我在 pyspark 中有一个名为 result 的数据框，我想应用一个 udf 来创建一个新列，如下所示：

result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)]).withColumnRenamed("_1","count").withColumnRenamed("_2","df").withColumnRenamed("_3","docs")
@udf("float")
def newFunction(arr):
    return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])

result=result.withColumn("new_function_result",newFunction_udf(array("count","df","docs")))

列数、df、docs 都是整数列。但这会返回

Py4JError：调用时出错 z:org.apache.spark.sql.functions.col。跟踪：py4j.Py4JException：方法 col([class java.util.ArrayList]) 不存在于 py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) 在 py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) 在 py4j.Gateway.invoke(Gateway.java:274) 在 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 在 py4j.commands.CallCommand.execute(CallCommand.java:79) 在 py4j.GatewayConnection.run(GatewayConnection.java:214) 在 java.lang.Thread.run(Thread.java:748)

当我尝试传递一列并获得其中的正方形时，它工作正常。

感谢任何帮助。

【问题讨论】：

请给我们reproducible example 并向我们展示完整的错误信息。
@cronoik 已编辑
抱歉，您的 createDataframe 函数会引发错误。不应该是sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)])吗？
更新带来不便敬请谅解

标签： python numpy apache-spark pyspark apache-spark-sql

【解决方案1】：

错误消息具有误导性，但它试图告诉您您的函数不返回浮点数。您的函数返回 numpy.float64 类型的值，您可以使用 VectorUDT 类型获取该值（函数：newFunctionVector 在下面的示例中）。使用 numpy 的另一种方法是将 numpy 类型 numpy.float64 转换为 python 类型 float（函数：newFunctionWithArray 在下面的示例中）。

最后但同样重要的是，没有必要调用array，因为 udfs 可以使用多个参数（函数：newFunction，在下面的示例中）。

import numpy as np
from pyspark.sql.functions import udf, array
from pyspark.sql.types import FloatType
from pyspark.mllib.linalg import Vectors, VectorUDT

result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)], ["count","df","docs"])

def newFunctionVector(arr):
    return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])

@udf("float")
def newFunctionWithArray(arr):
    returnValue = (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])
    return returnValue.item()

@udf("float")
def newFunction(count, df, docs):
    returnValue = (1 + np.log(count)) * np.log(docs/df)
    return returnValue.item()


vector_udf = udf(newFunctionVector, VectorUDT())

result=result.withColumn("new_function_result", newFunction("count","df","docs"))

result=result.withColumn("new_function_result_WithArray", newFunctionWithArray(array("count","df","docs")))

result=result.withColumn("new_function_result_Vector", newFunctionWithArray(array("count","df","docs")))

result.printSchema()

result.show()

输出：

root 
|-- count: long (nullable = true) 
|-- df: long (nullable = true) 
|-- docs: long (nullable = true) 
|-- new_function_result: float (nullable = true) 
|-- new_function_result_WithArray: float (nullable = true) 
|-- new_function_result_Vector: float (nullable = true)

+-----+---+----+-------------------+-----------------------------+--------------------------+ 
|count| df|docs|new_function_result|new_function_result_WithArray|new_function_result_Vector|
+-----+---+----+-------------------+-----------------------------+--------------------------+ 
|  138|  5|  10|           4.108459|                     4.108459|                  4.108459| 
|  128|  4|  10|           5.362161|                     5.362161|                  5.362161|
|  112|  3|  10|          6.8849173|                    6.8849173|                 6.8849173|
|  120|  3|  10|           6.967983|                     6.967983|                  6.967983|
|  189|  1|  10|          14.372153|                    14.372153|                 14.372153|  
+-----+---+----+-------------------+-----------------------------+--------------------------+

【讨论】：