如何在 Scala Spark 中舍入小数点答案

【问题标题】：How to round decimal in Scala Spark如何在 Scala Spark 中舍入小数点
【发布时间】：2018-12-26 04:56:03
【问题描述】：

我有一个（大约 100 万个）Scala Spark DataFrame，其中包含以下数据：

id,score
1,0.956
2,0.977
3,0.855
4,0.866
...

如何将分数离散化/四舍五入到最接近的小数点 0.05 位？

预期结果：

id,score
1,0.95
2,1.00
3,0.85
4,0.85
...

希望避免使用 UDF 以最大限度地提高性能。

【问题讨论】：

标签： scala apache-spark dataframe concurrency

【解决方案1】：

答案可以更简单：

dataframe.withColumn("rounded_score", round(col("score"), 2))

有办法

def round(e: Column, scale: Int)

使用 HALF_UP 舍入模式将 e 的值舍入到 scale 小数位

【讨论】：

我认为这并不能真正回答问题，因为似乎解决方案应该四舍五入到最接近的0.05。

【解决方案2】：

您可以使用 spark 内置的函数来完成此操作

dataframe.withColumn("rounded_score", round(col("score") * 100 / 5) * 5 / 100)

将其相乘以使所需的精度为整数。
然后将该数字除以 5，然后四舍五入。
现在这个数可以被 5 整除，所以乘以 5 得到整数
除以 100 以再次获得正确的精度。

结果

+---+-----+-------------+
| id|score|rounded_score|
+---+-----+-------------+
|  1|0.956|         0.95|
|  2|0.977|          1.0|
|  3|0.855|         0.85|
|  4|0.866|         0.85|
+---+-----+-------------+

【讨论】：

【解决方案3】：

您可以在转换为数据框时指定您的架构，

例子：

加载数据时自定义架构中列的 DecimalType(10, 2)。

id,score
1,0.956
2,0.977
3,0.855
4,0.866
...



import org.apache.spark.sql.types._

val mySchema = StructType(Array(
  StructField("id", IntegerType, true),
   StructField("score", DecimalType(10, 2), true)
))

spark.read.format("csv").schema(mySchema).
  option("header", "true").option("nullvalue", "?").
  load("/path/to/csvfile").show

【讨论】：