【问题标题】:How to add a column in Pysapar SQL dataframe with specific values?如何在 Pyspark SQL 数据框中添加具有特定值的列?
【发布时间】:2020-03-28 13:05:43
【问题描述】:

我有一张这样的桌子:

+--------------------+--------------------+-------------------+
|                  ID|               point|          timestamp|
+--------------------+--------------------+-------------------+
|679ac975acc4bdec9...|POINT (-73.267631...|2020-01-01 17:10:49|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:12:31|
|679ac975acc4bdec9...|POINT (-73.265991...|2020-01-01 17:10:40|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:54:15|
|679ac975acc4bdec9...|POINT (-73.265609...|2020-01-01 17:10:24|
+--------------------+--------------------+-------------------+

我想添加一列 point1,它与列 point 具有相同的值,但翻译后的行和最后一点等于 0

+--------------------+--------------------+-------------------+---------+---------+------+
|                  ID|               point|          timestamp|      lon|      lat|point1|
+--------------------+--------------------+-------------------+---------+---------+------+
|679ac975acc4bdec9...|POINT (-73.267631...|2020-01-01 17:10:49|-73.26763|40.850548|POINT (-73.271446...|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:12:31|-73.27145| 40.85318|POINT (-73.265991...|
|679ac975acc4bdec9...|POINT (-73.265991...|2020-01-01 17:10:40|-73.26599|40.851482|POINT (-73.271446...|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:54:15|-73.27145|40.853184|POINT (-73.265609...|
|679ac975acc4bdec9...|POINT (-73.265609...|2020-01-01 17:10:24|-73.26561|40.854164|     0|

【问题讨论】:

    标签: python sql pyspark pyspark-sql


    【解决方案1】:

    使用Window lead函数根据point列生成point1数据

    • 如果发现null 的潜在客户值,则将其替换为0

    df.show()
    #+-----------------+-----------------+-------------------+---------+---------+
    #|               ID|            point|          timestamp|      lon|      lat|
    #+-----------------+-----------------+-------------------+---------+---------+
    #|679ac975acc4bdec9|POINT (-73.267631|2020-01-01 17:10:49|-73.26763|40.850548|
    #|679ac975acc4bdec9|POINT (-73.271446|2020-01-01 02:12:31|-73.27145| 40.85318|
    #|679ac975acc4bdec9|POINT (-73.265991|2020-01-01 17:10:40|-73.26599|40.851482|
    #+-----------------+-----------------+-------------------+---------+---------+
    
    from pyspark.sql.window import Window
    from pyspark.sql.functions import *
    
    #change orderby column if you need some specific order based on some column
    w = Window.partitionBy('ID').orderBy(lit("1"))
    
    df.withColumn("point1",lead("point",1,0).over(w)).show()
    #+-----------------+-----------------+-------------------+---------+---------+-----------------+
    #|               ID|            point|          timestamp|      lon|      lat|           point1|
    #+-----------------+-----------------+-------------------+---------+---------+-----------------+
    #|679ac975acc4bdec9|POINT (-73.267631|2020-01-01 17:10:49|-73.26763|40.850548|POINT (-73.271446|
    #|679ac975acc4bdec9|POINT (-73.271446|2020-01-01 02:12:31|-73.27145| 40.85318|POINT (-73.265991|
    #|679ac975acc4bdec9|POINT (-73.265991|2020-01-01 17:10:40|-73.26599|40.851482|                0|
    #+-----------------+-----------------+-------------------+---------+---------+-----------------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-12-17
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-08-10
      相关资源
      最近更新 更多