【问题标题】:Compare the values in different rows of a dataframe and create new dataframe with rows satisfying the conditions比较数据框不同行中的值并创建具有满足条件的行的新数据框
【发布时间】:2020-02-07 22:02:53
【问题描述】:

我需要在数据帧的不同行上应用一些逻辑,并创建一个新数据帧,其中的行只满足逻辑。

输入的dataframe如下图。

+------------+-------------+-----+-----+-----+-----+
| NUM_ID     | E           |SG1_V|SG2_V|SG3_V|SG4_V|
+------------+-------------+-----+-----+-----+-----+
|XXXXX01     |1570167499000|     |     | 89.0|     |
|XXXXX01     |1570167502000|     |88.0 |     |     |
|XXXXX01     |1570167503000|     |99.0 |     |     |
|XXXXX01     |1570179810000|81.0 |81.0 |81.0 |81.0 |
|XXXXX01     |1570179811000|92.0 |     |95.0 |     |
|XXXXX01     |1570179833000|     |     |88.0 |     |
|XXXXX02     |1570179840000|     |81.0 |     |81.0 |
|XXXXX02     |1570179841000|81.0 |     |81.0 |81.0 |
|XXXXX02     |1570179841000|     |     |     |     |
|XXXXX02     |1570179842000|81.0 |     |     |     |
|XXXXX02     |1570179843000|87.0 |98.0 |97.0 |88.0 |
|XXXXX02     |1570179849000|     |     |     |     |
|XXXXX03     |1570179850000|     |     |     |     |
|XXXXX03     |1570179852000|88.0 |     |     |     |
|XXXXX03     |1570179857000|     |     |     |88.0 |
|XXXXX03     |1570179858000|     |     |     |88.0 |

我必须检查每个 SG_V 列的值,以便 NUM_ID 的每个 SG_V 之间的差异大于10. 一行中单个SG_V或多个SG_V列的差值10将被视为单行。

一旦您查看预期的输出,就会很清楚。 预期输出如下。

+------------+-------------+------------+-----+------------+-----+------------+-----+------------+-----+
| NUM_ID     | E           |PREVIOUS_SG1|SG1_V|PREVIOUS_SG2|SG2_V|PREVIOUS_SG3|SG3_V|PREVIOUS_SG4|SG4_V|
+------------+-------------+------------+-----+------------+-----+------------+-----+------------+-----+
|XXXXX01     |1570167503000|            |     | 88.0       |99.0 |            |     |            |     |
|XXXXX01     |1570179811000|81.0        |92.0 |            |     |81.0        |95.0 |            |     |

|XXXXX02     |1570179843000|            |     |81.0        |98.0 |81.0        |97.0 |            |     |

提前致谢!任何线索表示赞赏。

【问题讨论】:

    标签: scala dataframe apache-spark apache-spark-sql apache-spark-dataset


    【解决方案1】:

    也许是这样的:

    我计算了差异,然后检查它是否 > 10,放入布尔数组,最后使用 array_contains 检查是否包含错误值

      import spark.implicits._
      import org.apache.spark.sql.functions._
    
      val df = Seq(
        (10, 21, 32, 43),
        (10, 20, 30, 40),
        (1, 2, 3, 4),
        (1, 100, 200, 300)
      ).toDF().withColumn("id",monotonically_increasing_id())
    
      df.show()
    
      val cols = df.columns.dropRight(1)
      var pairs: Array[(String, String)] = new Array[(String, String)](cols.length - 1)
      for (i <- 0 to cols.length - 2) {
        pairs(i) = (cols.apply(i), cols.apply(i + 1))
      }
    
      println("pairs:")
      pairs.foreach(print(_))
    
      val calcDiff = array_contains(
        array(
          pairs.map(s=>(df(s._2)-df(s._1))>10):_*
        ), false
      )
    
      df.filter(calcDiff).show()
    

    输出:

    +---+---+---+---+---+
    | _1| _2| _3| _4| id|
    +---+---+---+---+---+
    | 10| 21| 32| 43|  0|
    | 10| 20| 30| 40|  1|
    |  1|  2|  3|  4|  2|
    |  1|100|200|300|  3|
    +---+---+---+---+---+
    
    pairs:
    (_1,_2)(_2,_3)(_3,_4)
    
    +---+---+---+---+---+
    | _1| _2| _3| _4| id|
    +---+---+---+---+---+
    | 10| 21| 32| 43|  0|
    |  1|100|200|300|  3|
    +---+---+---+---+---+
    

    【讨论】:

      猜你喜欢
      • 2020-03-02
      • 1970-01-01
      • 1970-01-01
      • 2019-11-19
      • 1970-01-01
      • 1970-01-01
      • 2021-02-03
      • 1970-01-01
      • 2022-11-18
      相关资源
      最近更新 更多