【问题标题】:Comparing Previous data with Current data in spark scala将以前的数据与 spark scala 中的当前数据进行比较
【发布时间】:2019-02-15 14:49:53
【问题描述】:

我想按月比较 Prev.data 和 Current 数据。我有如下数据。

Data-set 1 : (Prev)                             Data-set 2 : (Latest)

   Year-month  Sum-count                 Year-Month    Sum-count
      --          --                       201808          48     
     201807       30                       201807          22   
     201806       20                       201806          20
     201805       35                       201805          20
     201804       12                       201804           9
     201803       15                       --              --

我有如上所示的数据集。我想根据年月列和总和来比较这两个数据集,并且需要找出百分比的差异。

我正在使用 spark 2.3.0 和 Scala 2.11。

这里是模式:

import org.apache.spark.sql.functions.lag

val mdf = spark.read.format("csv").
          option("InferSchema","true").
          option("header","true").
          option("delimiter",",").
          option("charset","utf-8").
          load("c:\\test.csv")
mdf.createOrReplaceTempView("test")
val res= spark.sql("select year-month,SUM(Sum-count) as SUM_AMT from test d group by year-month")
val win = org.apache.spark.sql.expressions.Window.orderBy("data_ym")

val res1 = res.withColumn("Prev_month", lag("SUM_AMT", 1,0).over(win)).withColumn("percentage",col("Prev_month") / sum("SUM_AMT").over()).show()

我需要这样的输出:

如果百分比超过 10%,那么我需要将标志设置为 F。

set1     cnt             set2    cnt     output(Percentage)  Flag
201807   30             201807   22         7%                T
201806   20             201806   20         0%                T
201805   35             201805   20         57%               F

请帮帮我。

【问题讨论】:

  • 你能添加一个想要输出的例子吗?谢谢
  • 您好,我已经更新了我的查询。非常感谢。
  • 你是如何计算百分比的?输出中指定的值是否正确?

标签: scala apache-spark-sql


【解决方案1】:

可以这样做:

val data1 = List(
  ("201807", 30),
  ("201806", 20),
  ("201805", 35),
  ("201804", 12),
  ("201803", 15)
)
val data2 = List(
  ("201808", 48),
  ("201807", 22),
  ("201806", 20),
  ("201805", 20),
  ("201804", 9)
)
val df1 = data1.toDF("Year-month", "Sum-count")
val df2 = data2.toDF("Year-month", "Sum-count")

val joined = df1.alias("df1").join(df2.alias("df2"), "Year-month")
joined
  .withColumn("output(Percentage)", abs($"df1.Sum-count" - $"df2.Sum-count").divide($"df1.Sum-count"))
  .withColumn("Flag", when($"output(Percentage)" > 0.1, "F").otherwise("T"))
  .show(false)

输出:

+----------+---------+---------+-------------------+----+
|Year-month|Sum-count|Sum-count|output(Percentage) |Flag|
+----------+---------+---------+-------------------+----+
|201807    |30       |22       |0.26666666666666666|F   |
|201806    |20       |20       |0.0                |T   |
|201805    |35       |20       |0.42857142857142855|F   |
|201804    |12       |9        |0.25               |F   |
+----------+---------+---------+-------------------+----+

【讨论】:

  • 您好,感谢您的帮助。如果是 10 %(加号)或(减号),那么如何考虑这些值?如果我们放 abs() 那么它只会给出正值。感谢您对这个案例的帮助。
【解决方案2】:

这是我的解决方案:

val values1 = List(List("1201807", "30")
  ,List("1201806", "20") ,
  List("1201805", "35"),
  List("1201804","12"),
  List("1201803","15")
).map(x =>(x(0), x(1)))

val values2 = List(List("201808", "48")
  ,List("1201807", "22") ,
  List("1201806", "20"),
  List("1201805","20"),
  List("1201804","9")
).map(x =>(x(0), x(1)))
import spark.implicits._
import org.apache.spark.sql.functions
val df1 = values1.toDF
val df2 = values2.toDF
df1.join(df2, Seq("_1"), "full").toDF("set", "cnt1", "cnt2")
  .withColumn("percentage1", col("cnt1")/sum("cnt1").over() * 100)
  .withColumn("percentage2", col("cnt2")/sum("cnt2").over() * 100)
  .withColumn("percentage", abs(col("percentage2") - col("percentage1")))
  .withColumn("flag", when(col("percentage") > 10, "F").otherwise("T")).na.drop().show()

结果如下:

+-------+----+----+------------------+------------------+------------------+----+
|    set|cnt1|cnt2|       percentage1|       percentage2|        percentage|flag|
+-------+----+----+------------------+------------------+------------------+----+
|1201804|  12|   9|10.714285714285714| 7.563025210084033|  3.15126050420168|   T|
|1201807|  30|  22|26.785714285714285|18.487394957983195|  8.29831932773109|   T|
|1201806|  20|  20|17.857142857142858| 16.80672268907563|1.0504201680672267|   T|
|1201805|  35|  20|             31.25| 16.80672268907563|14.443277310924369|   F|
+-------+----+----+------------------+------------------+------------------+----+

希望对你有帮助:)

【讨论】:

  • 不客气,您认为解决方案很有帮助,能否请您投票/标记解决方案?谢谢,
  • 当然。我已经做到了。最后一点如何设置百分比低于 10 的正负值。非常感谢。
猜你喜欢
  • 1970-01-01
  • 2011-07-18
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2013-01-16
  • 1970-01-01
相关资源
最近更新 更多