【发布时间】:2017-07-19 11:23:01
【问题描述】:
我有一个如下所示的数据框
| id| age| rbc| bgr| dm|cad|appet| pe|ane|classification|
+---+----+------+-----+---+---+-----+---+---+--------------+
| 3|48.0|normal|117.0| no| no| poor|yes|yes| ckd|
....
....
....
我编写了一个 UDF 来将分类 yes, no, poor, normal 转换为二进制 0s 和 1s
def stringToBinary(stringValue: String): Int = {
stringValue match {
case "yes" => return 1
case "no" => return 0
case "present" => return 1
case "notpresent" => return 0
case "normal" => return 1
case "abnormal" => return 0
}
}
val stringToBinaryUDF = udf(stringToBinary _)
我将其应用于数据框如下
val newCol = stringToBinaryUDF.apply(col("pc")) //creates the new column with formatted value
val refined1 = noZeroDF.withColumn("dm", newCol) //adds the new column to original
如何将多个列传递到 UDF,这样我就不必为其他分类列重复自己?
【问题讨论】:
标签: scala apache-spark user-defined-functions