Python到Pyspark Functions UDF如何输出列表列表答案

【问题标题】：Python to Pyspark Functions UDF how to output a list of listsPython到Pyspark Functions UDF如何输出列表列表
【发布时间】：2021-11-13 16:33:16
【问题描述】：

我在 python 中有一个函数（许多不同的函数，但情况相同），我将它转换为 PySpark，但是，这个函数有一个不同整数类型的列表作为输入，并且有一个输出是一个列表，它包含其中 n 个 Integer 类型的列表，一个例子：

#I know some libraries are not necessary righ now
import pyspark
from pyspark import SQLContext
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, StringType, ArrayType
from pyspark.sql.functions import udf
from pyspark.sql import Row
from pyspark.sql import functions as F

my_function_input([4,5,7,8,10, 11]) 
my_function_output[[4, 5], [7, 8], [10, 11]]

这是我正在尝试的，但在尝试使用时出错

pyspark_my_function = udf(my_function, ArrayType(IntegerType()))

TypeError：参数无效，不是字符串或列：类型的 [4, 5, 6, 8, 9]。对于列字面量，请使用“lit”、“array”、“struct”或“create_map”函数。

我还有一些其他函数，它们有 2 或 3 个输出，它们也是每个内部的列表列表。我怎样才能改变它们？这个我试过了

schema = StructType([StructField("output1", ArrayType(IntegerType()), nullable=False), 
                     StructField("output2", ArrayType(IntegerType()), nullable=False)])

pyspark_function = udf(my_function, schema)

谢谢你们！

【问题讨论】：

标签： python apache-spark pyspark apache-spark-sql user-defined-functions

【解决方案1】：

您的 UDF 输入和输出似乎都有问题。看看下面的示例代码。

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = (spark
    .sparkContext
    .parallelize([
        ([4,5,7,8,9],),
    ])
    .toDF(['A'])
 )

def myfunc(a):
    return [[4, 5], [7, 8], [10, 11]]

(df
    .withColumn('test', F.udf(myfunc, T.ArrayType(T.ArrayType(T.IntegerType())))('A'))
    .show()
)

# Output
# +---------------+--------------------+
# |              A|                test|
# +---------------+--------------------+
# |[4, 5, 7, 8, 9]|[[4, 5], [7, 8], ...|
# +---------------+--------------------+

【讨论】：