如何在 Pyspark SQL 数据框中添加具有特定值的列？答案

【问题标题】：How to add a column in Pysapar SQL dataframe with specific values?如何在 Pyspark SQL 数据框中添加具有特定值的列？
【发布时间】：2020-03-28 13:05:43
【问题描述】：

我有一张这样的桌子：

+--------------------+--------------------+-------------------+
|                  ID|               point|          timestamp|
+--------------------+--------------------+-------------------+
|679ac975acc4bdec9...|POINT (-73.267631...|2020-01-01 17:10:49|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:12:31|
|679ac975acc4bdec9...|POINT (-73.265991...|2020-01-01 17:10:40|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:54:15|
|679ac975acc4bdec9...|POINT (-73.265609...|2020-01-01 17:10:24|
+--------------------+--------------------+-------------------+

我想添加一列 point1，它与列 point 具有相同的值，但翻译后的行和最后一点等于 0

+--------------------+--------------------+-------------------+---------+---------+------+
|                  ID|               point|          timestamp|      lon|      lat|point1|
+--------------------+--------------------+-------------------+---------+---------+------+
|679ac975acc4bdec9...|POINT (-73.267631...|2020-01-01 17:10:49|-73.26763|40.850548|POINT (-73.271446...|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:12:31|-73.27145| 40.85318|POINT (-73.265991...|
|679ac975acc4bdec9...|POINT (-73.265991...|2020-01-01 17:10:40|-73.26599|40.851482|POINT (-73.271446...|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:54:15|-73.27145|40.853184|POINT (-73.265609...|
|679ac975acc4bdec9...|POINT (-73.265609...|2020-01-01 17:10:24|-73.26561|40.854164|     0|

【问题讨论】：

标签： python sql pyspark pyspark-sql

【解决方案1】：

使用Window lead函数根据point列生成point1数据

如果发现null 的潜在客户值，则将其替换为0

df.show()
#+-----------------+-----------------+-------------------+---------+---------+
#|               ID|            point|          timestamp|      lon|      lat|
#+-----------------+-----------------+-------------------+---------+---------+
#|679ac975acc4bdec9|POINT (-73.267631|2020-01-01 17:10:49|-73.26763|40.850548|
#|679ac975acc4bdec9|POINT (-73.271446|2020-01-01 02:12:31|-73.27145| 40.85318|
#|679ac975acc4bdec9|POINT (-73.265991|2020-01-01 17:10:40|-73.26599|40.851482|
#+-----------------+-----------------+-------------------+---------+---------+

from pyspark.sql.window import Window
from pyspark.sql.functions import *

#change orderby column if you need some specific order based on some column
w = Window.partitionBy('ID').orderBy(lit("1"))

df.withColumn("point1",lead("point",1,0).over(w)).show()
#+-----------------+-----------------+-------------------+---------+---------+-----------------+
#|               ID|            point|          timestamp|      lon|      lat|           point1|
#+-----------------+-----------------+-------------------+---------+---------+-----------------+
#|679ac975acc4bdec9|POINT (-73.267631|2020-01-01 17:10:49|-73.26763|40.850548|POINT (-73.271446|
#|679ac975acc4bdec9|POINT (-73.271446|2020-01-01 02:12:31|-73.27145| 40.85318|POINT (-73.265991|
#|679ac975acc4bdec9|POINT (-73.265991|2020-01-01 17:10:40|-73.26599|40.851482|                0|
#+-----------------+-----------------+-------------------+---------+---------+-----------------+

【讨论】：