【问题标题】:how to check differences in rows belonging to two dataframes如何检查属于两个数据框的行的差异
【发布时间】:2016-04-09 08:40:46
【问题描述】:

我有两个数据框,它们代表同一个人的两个不同时期。我想了解,对于每一行,两个数据框的第 5(固定)列是否有任何变化。

之前:

+--+------+------+------+------+------+------+
|id| sport|  var1|  var2|  var3|  var4|  var5|
+--+------+------+------+------+------+------+
| 1|soccer|330234|      |      |      |      |
| 2|soccer|  null|  null|  null|  null|  null|
| 3|soccer|330101|      |      |      |      |
| 4|soccer|  null|  null|  null|  null|  null|
| 5|soccer|  null|  null|  null|  null|  null|
| 6|soccer|  null|  null|  null|  null|  null|
| 7|soccer|  null|  null|  null|  null|  null|
| 8|soccer|330024|330401|      |      |      |
| 9|soccer|330055|330106|      |      |      |
|10|soccer|  null|  null|  null|  null|  null|
|11|soccer|390027|      |      |      |      |
|12|soccer|  null|  null|  null|  null|  null|
|13|soccer|330101|      |      |      |      |
|14|soccer|330059|      |      |      |      |
|15|soccer|  null|  null|  null|  null|  null|
|16|soccer|140242|140281|      |      |      |
|17|soccer|330214|      |      |      |      |
|18|soccer|      |      |      |      |      |
|19|soccer|330055|330196|      |      |      |
|20|soccer|210022|      |      |      |      |
+--+------+------+------+------+------+------+

之后:

+--+------+------+------+------+------+------+
|id| sport|  var1|  var2|  var3|  var4|  var5|
+--+------+------+------+------+------+------+
| 1|soccer|330234|      |      |      |      |
| 2|soccer|  null|  null|  null|  null|  null|
| 3|soccer|330101|      |      |      |      |
| 4|soccer|  null|  null|  null|  null|  null|
| 5|soccer|  null|  null|  null|  null|  null|
| 6|soccer|  null|  null|  null|  null|  null|
| 7|soccer|  null|  null|  null|  null|  null|
| 8|soccer|  null|  null|  null|  null|  null|
| 9|soccer|330106|      |      |      |      |
|10|soccer|  null|  null|  null|  null|  null|
|11|soccer|390027|      |      |      |      |
|12|soccer|  null|  null|  null|  null|  null|
|13|soccer|  null|  null|  null|  null|  null|
|14|soccer|330128|330331|330106|330059|      |
|15|soccer|  null|  null|  null|  null|  null|
|16|soccer|140242|140281|140010|      |      |
|17|soccer|330214|      |      |      |      |
|18|soccer|  null|  null|  null|  null|  null|
|19|soccer|330196|      |      |      |      |
|20|soccer|210022|      |      |      |      |
+--+------+------+------+------+------+------+

我知道如何扫描属于一行的列之间的差异,但我很不知道如何比较两个不同数据帧的行。

理想的输出是:

+--+------+------+
|id| sport|  diff|
+--+------+------+
| 1|soccer|     0|
| 2|soccer|     0|
| 3|soccer|     0|
| 4|soccer|     0|
| 5|soccer|     0|
| 6|soccer|     0|
| 7|soccer|     0|
| 8|soccer|     1|
| 9|soccer|     1|
|10|soccer|     0|
|11|soccer|     0|
|12|soccer|     0|
|13|soccer|     1|
|14|soccer|     1|
|15|soccer|     0|
|16|soccer|     1| 
|17|soccer|     0| 
|18|soccer|     0| 
|19|soccer|     1| 
|20|soccer|     0| 

【问题讨论】:

    标签: scala apache-spark dataframe apache-spark-sql


    【解决方案1】:

    你的意思是这样的吗?让我们从示例数据开始:

    val before = Seq(
      (1, "soccer", Some(1), Some(2), Some(3), Some(4), None),
      (2, "soccer", None,    Some(0), None,    None,    Some(0)),
      (3, "soccer", None,    None,    None,    None,    None)
    ).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")
    
    val after = Seq(
      (1, "soccer", Some(1), Some(2), Some(3), Some(4), None), // Zero diffs
      (2, "soccer", Some(1), Some(0), None,    None,    Some(0)), // One diff
      (3, "soccer", Some(1), Some(1), Some(1), Some(1), Some(1)) // Five diffs
    ).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")
    

    生成计算差异的表达式:

    // Extract var columns
    val varCols = before.columns.drop(2)
    
    // Generate a list of exprs 
    // CAST(NOT(before.var1 <=> after.var1) AS INT)
    val equalsExprs = varCols.map(
      c => not(col(s"before.$c") <=> col(s"after.$c")).cast("int").alias(s"${c}_ne"))
    
    // SUM 
    val diff = equalsExprs.foldLeft(lit(0))(_ + _).alias("diff")
    

    它会处理:

    • 两个 NULL 相等
    • 任何值和 NULL 不相等
    • 两个非 NULL 值 - 标准类型相等

    加入并选择表达式:

    val diffs = before.as("before").join(after.as("after"), Seq("id", "sport"))
      .select($"id", $"sport", diff)
    
    diffs.show
    
    // +---+------+----+ 
    // | id| sport|diff|
    // +---+------+----+
    // |  1|soccer|   0|
    // |  2|soccer|   1|
    // |  3|soccer|   5|
    // +---+------+----+
    

    【讨论】:

    • 我想知道是否可以编写一个表达式,不仅计算差异,而且了解这些差异是对当前状态的加法还是减法。说之前我有Some(1), Some(2), None, None, None 和之后像Some(1), Some(2), Some(3), Some(4), None 与之后像None, None, None, None, None... 这两个变化,但在第一种情况下它是+2,而在第二种情况下是-2
    猜你喜欢
    • 2023-03-12
    • 2018-06-27
    • 2021-08-27
    • 1970-01-01
    • 2022-11-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-11-23
    相关资源
    最近更新 更多