【发布时间】:2022-02-22 03:56:26
【问题描述】:
我有两个数据框,我正在尝试编写一个函数来比较这两个数据框,以便它将受影响列的净更改返回给我。
DF1:
+---------------+------+------+-------+----------+
| City | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta | 10 | 1 | 100 | 400 |
+---------------+------+------+-------+----------+
| Chicago | 100 | 2 | 200 | 500 |
+---------------+------+------+-------+----------+
| Boston | 100 | 3 | 300 | 600 |
+---------------+------+------+-------+----------+
| San Francisco | 1000 | 4 | 400 | 700 |
+---------------+------+------+-------+----------+
DF2:
+---------------+------+------+-------+----------+
| City | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta | 10 | 1 | 150 | 400 |
+---------------+------+------+-------+----------+
| Chicago | 100 | 2 | 200 | 450 |
+---------------+------+------+-------+----------+
| Boston | 100 | 3 | 300 | 650 |
+---------------+------+------+-------+----------+
| San Francisco | 1200 | 4 | 400 | 750 |
+---------------+------+------+-------+----------+
我希望结果是这样的:
+---------------+------+------+-------+----------+
| City | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta | 0 | 0 | 50 | 0 |
+---------------+------+------+-------+----------+
| Boston | 0 | 0 | 0 | -50 |
+---------------+------+------+-------+----------+
| San Francisco | 200 | 0 | 0 | 50 |
+---------------+------+------+-------+----------+
我是 PySpark 的新手,想知道如何在 PySpark 中实现这一点?
我尝试执行df2.substract(df1),但它只是向我显示了 df2 中不在 df1 中的行,这不是很简单,如果我只想查看 任何列发生的净变化。
注意:城市名称是唯一标识符。每一行都不一样。
感谢您的帮助!
【问题讨论】:
-
但是在我的真实数据中,每个数据框都有很多列,我希望该函数能够确定/监控哪些列发生了变化,而不必专门命名它/它们