【问题标题】:How to find quantiles inside agg() function after groupBy in Scala SPARK如何在 Scala SPARK 中的 groupBy 之后在 agg() 函数中找到分位数
【发布时间】:2019-09-03 07:06:47
【问题描述】:

我有一个数据框,我想在其中对 A 列进行分组,然后找到不同的统计数据,例如平均值、最小值、最大值、标准差和分位数。

我可以使用以下代码找到最小值、最大值和平均值: df.groupBy("A").agg(min("B"), max("B"), mean("B")).show(50, false)

但我找不到分位数(0.25、0.5、0.75)。我尝试了 approxQuantile 和 percentile 但它给出了以下错误:

错误:未找到:值 approxQuantile

【问题讨论】:

  • 我希望您尝试从数据框/数据集中获取一些样本数据。然后火花有sample(fraction: Double) API 存在。请尝试那个

标签: scala apache-spark group-by aggregate quantile


【解决方案1】:

如果您在类路径中有 Hive,则可以使用许多 UDAF,例如 percentile_approx 和 stddev_samp,请参阅 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)

您可以使用callUDF 调用这些函数:

import ss.implicits._
import org.apache.spark.sql.functions.callUDF

val df = Seq(1.0,2.0,3.0).toDF("x")

df.groupBy()
  .agg(
    callUDF("percentile_approx",$"x",lit(0.5)).as("median"),
    callUDF("stddev_samp",$"x").as("stdev")
  )
.show()

【讨论】:

    【解决方案2】:

    这是我在 Spark 3.1 上测试过的代码

    val simpleData = Seq(("James","Sales","NY",90000,34,10000),
        ("Michael","Sales","NY",86000,56,20000),
        ("Robert","Sales","CA",81000,30,23000),
        ("Maria","Finance","CA",90000,24,23000),
        ("Raman","Finance","CA",99000,40,24000),
        ("Scott","Finance","NY",83000,36,19000),
        ("Jen","Finance","NY",79000,53,15000),
        ("Jeff","Marketing","CA",80000,25,18000),
        ("Kumar","Marketing","NY",91000,50,21000)
      )
    val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
    df.show()
    
    
    df.groupBy($"department")
    .agg(
     percentile_approx($"salary",lit(0.5), lit(10000))
    )
    .show(false)
    

    输出

    +-------------+----------+-----+------+---+-----+
    |employee_name|department|state|salary|age|bonus|
    +-------------+----------+-----+------+---+-----+
    |        James|     Sales|   NY| 90000| 34|10000|
    |      Michael|     Sales|   NY| 86000| 56|20000|
    |       Robert|     Sales|   CA| 81000| 30|23000|
    |        Maria|   Finance|   CA| 90000| 24|23000|
    |        Raman|   Finance|   CA| 99000| 40|24000|
    |        Scott|   Finance|   NY| 83000| 36|19000|
    |          Jen|   Finance|   NY| 79000| 53|15000|
    |         Jeff| Marketing|   CA| 80000| 25|18000|
    |        Kumar| Marketing|   NY| 91000| 50|21000|
    +-------------+----------+-----+------+---+-----+
    
    +----------+-------------------------------------+
    |department|percentile_approx(salary, 0.5, 10000)|
    +----------+-------------------------------------+
    |Sales     |86000                                |
    |Finance   |83000                                |
    |Marketing |80000                                |
    +----------+-------------------------------------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-05-17
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-09-22
      相关资源
      最近更新 更多