【发布时间】:2020-02-07 22:02:53
【问题描述】:
我需要在数据帧的不同行上应用一些逻辑,并创建一个新数据帧,其中的行只满足逻辑。
输入的dataframe如下图。
+------------+-------------+-----+-----+-----+-----+
| NUM_ID | E |SG1_V|SG2_V|SG3_V|SG4_V|
+------------+-------------+-----+-----+-----+-----+
|XXXXX01 |1570167499000| | | 89.0| |
|XXXXX01 |1570167502000| |88.0 | | |
|XXXXX01 |1570167503000| |99.0 | | |
|XXXXX01 |1570179810000|81.0 |81.0 |81.0 |81.0 |
|XXXXX01 |1570179811000|92.0 | |95.0 | |
|XXXXX01 |1570179833000| | |88.0 | |
|XXXXX02 |1570179840000| |81.0 | |81.0 |
|XXXXX02 |1570179841000|81.0 | |81.0 |81.0 |
|XXXXX02 |1570179841000| | | | |
|XXXXX02 |1570179842000|81.0 | | | |
|XXXXX02 |1570179843000|87.0 |98.0 |97.0 |88.0 |
|XXXXX02 |1570179849000| | | | |
|XXXXX03 |1570179850000| | | | |
|XXXXX03 |1570179852000|88.0 | | | |
|XXXXX03 |1570179857000| | | |88.0 |
|XXXXX03 |1570179858000| | | |88.0 |
我必须检查每个 SG_V 列的值,以便 NUM_ID 的每个 SG_V 之间的差异大于10. 一行中单个SG_V或多个SG_V列的差值10将被视为单行。
一旦您查看预期的输出,就会很清楚。 预期输出如下。
+------------+-------------+------------+-----+------------+-----+------------+-----+------------+-----+
| NUM_ID | E |PREVIOUS_SG1|SG1_V|PREVIOUS_SG2|SG2_V|PREVIOUS_SG3|SG3_V|PREVIOUS_SG4|SG4_V|
+------------+-------------+------------+-----+------------+-----+------------+-----+------------+-----+
|XXXXX01 |1570167503000| | | 88.0 |99.0 | | | | |
|XXXXX01 |1570179811000|81.0 |92.0 | | |81.0 |95.0 | | |
|XXXXX02 |1570179843000| | |81.0 |98.0 |81.0 |97.0 | | |
提前致谢!任何线索表示赞赏。
【问题讨论】:
标签: scala dataframe apache-spark apache-spark-sql apache-spark-dataset