假设您的输入数据框如下
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|3 |7 |21 |9 |
|5 |15 |10 |2 |
+----+----+----+----+
然后你可以写一个udf函数来得到你想要的输出列
from pyspark.sql import functions as f
from pyspark.sql import types as t
def sortAndIndex(list):
return sorted([(value, index+1) for index, value in enumerate(sorted(list))], reverse=True)
sortAndIndexUdf = f.udf(sortAndIndex, t.ArrayType(t.StructType([t.StructField('key', t.IntegerType(), True), t.StructField('value', t.IntegerType(), True)])))
df.withColumn('sortedAndIndexed', sortAndIndexUdf(f.array([x for x in df.columns])))
这应该给你
+----+----+----+----+----------------------------------+
|col1|col2|col3|col4|sortedAndIndexed |
+----+----+----+----+----------------------------------+
|3 |7 |21 |9 |[[21, 4], [9, 3], [7, 2], [3, 1]] |
|5 |15 |10 |2 |[[15, 4], [10, 3], [5, 2], [2, 1]]|
+----+----+----+----+----------------------------------+
更新
你评论为
我的计算应该是 sum(value/index) 所以可能使用你的 udf 函数我应该返回某种 reduce(add,)?
你可以这样做
from pyspark.sql import functions as f
from pyspark.sql import types as t
def divideAndSum(list):
return sum([float(value)/(index+1) for index, value in enumerate(sorted(list))])
divideAndSumUdf = f.udf(divideAndSum, t.DoubleType())
df.withColumn('divideAndSum', divideAndSumUdf(f.array([x for x in df.columns])))
这应该给你
+----+----+----+----+------------------+
|col1|col2|col3|col4|divideAndSum |
+----+----+----+----+------------------+
|3 |7 |21 |9 |14.75 |
|5 |15 |10 |2 |11.583333333333334|
+----+----+----+----+------------------+