【问题标题】:How to compare two dataframes and calculate the differences in PySpark?如何比较两个数据框并计算 PySpark 中的差异?
【发布时间】:2022-02-22 03:56:26
【问题描述】:

我有两个数据框,我正在尝试编写一个函数来比较这两个数据框,以便它将受影响列的净更改返回给我。

DF1:

+---------------+------+------+-------+----------+
| City          | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta       | 10   | 1    | 100   | 400      |
+---------------+------+------+-------+----------+
| Chicago       | 100  | 2    | 200   | 500      |
+---------------+------+------+-------+----------+
| Boston        | 100  | 3    | 300   | 600      |
+---------------+------+------+-------+----------+
| San Francisco | 1000 | 4    | 400   | 700      |
+---------------+------+------+-------+----------+

DF2:

+---------------+------+------+-------+----------+
| City          | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta       | 10   | 1    | 150   | 400      |
+---------------+------+------+-------+----------+
| Chicago       | 100  | 2    | 200   | 450      |
+---------------+------+------+-------+----------+
| Boston        | 100  | 3    | 300   | 650      |
+---------------+------+------+-------+----------+
| San Francisco | 1200 | 4    | 400   | 750      |
+---------------+------+------+-------+----------+

我希望结果是这样的:

+---------------+------+------+-------+----------+
| City          | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta       | 0    | 0    | 50    | 0        |
+---------------+------+------+-------+----------+
| Boston        | 0    | 0    | 0     | -50      |
+---------------+------+------+-------+----------+
| San Francisco | 200  | 0    | 0     | 50       |
+---------------+------+------+-------+----------+

我是 PySpark 的新手,想知道如何在 PySpark 中实现这一点?

我尝试执行df2.substract(df1),但它只是向我显示了 df2 中不在 df1 中的行,这不是很简单,如果我只想查看 任何列发生的净变化。

注意:城市名称是唯一标识符。每一行都不一样。

感谢您的帮助!

【问题讨论】:

标签: python dataframe pyspark


【解决方案1】:

dataframe.subtract(dataframe) 是逻辑减法 (EXCEPT DISTINCT)。


因此,您可以加入并在列之间进行算术减法。
df = df1.join(df2, on='City').cache()

for col in df1.columns:
    if col != 'City':
        df = df.withColumn('diff_' + col, df2[col] - df1[col]).drop(col)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-10-03
    • 2020-06-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多