【发布时间】:2021-06-25 16:55:31
【问题描述】:
我正在尝试从列列表中获取最大值以及具有 hte 最大值的列的名称,如这些帖子中所述
PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe
how to get the name of column with maximum value in pyspark dataframe
我查看了许多帖子并尝试了多种选择,但尚未成功。
列对象不可调用TypeError: 'Column' object is not callable using WithColumn 并传递多列Pyspark: Pass multiple columns in UDF
加载到数据框的表格中的列 Rule_Total_Score:双倍, Rule_No_Identifier_Score:double
rules = ['Rule_Total_Score', 'Rule_No_Identifier_Score']
df = spark.sql('select * from table')
@f.udf(DoubleType())
def get_max_row_with_None(*cols):
return float(max(x for x in cols if x is not None))
sdf = df.withColumn("max_rule", get_max_row_with_None(f.struct([df[col] for col in df.columns if col in rules])))
【问题讨论】:
标签: apache-spark pyspark apache-spark-sql user-defined-functions