【发布时间】:2021-11-27 12:00:21
【问题描述】:
我正在尝试计算 PySpark Dataframe 列中列表(成本)的平均值,小于平均值的值得到 1,高于平均值的值为 0。
这是当前的数据框:
+----------+--------------------+--------------------+
| id| collect_list(p_id)|collect_list(cost) |
+----------+--------------------+--------------------+
| 7|[10, 987, 872] |[12.0, 124.6, 197.0]|
| 6|[11, 858, 299] |[15.0, 167.16, 50.0]|
| 17| [2]| [65.4785]|
| 1|[34359738369, 343...|[16.023384, 104.9...|
| 3|[17179869185, 0, ...|[48.3255, 132.025...|
+----------+--------------------+--------------------+
这是所需的输出:
+----------+--------------------+--------------------+-----------+
| id| p_id |cost | result |
+----------+--------------------+--------------------+-----------+
| 7|10 |12.0 | 1 |
| 7|987 |124.6 | 0 |
| 7|872 |197.0 | 0 |
| 6|11 |15.0 | 1 |
| 6|858 |167.16 | 0 |
| 6|299 |50.0 | 1 |
| 17|2 |65.4785 | 1 |
+----------+--------------------+--------------------+-----------+
【问题讨论】:
标签: python pyspark apache-spark-sql