【问题标题】:Pyspark Generate rows depending on column valuePyspark 根据列值生成行
【发布时间】:2019-11-16 06:34:04
【问题描述】:

下面是数据输入,

|       start       |   format_date     |    diff|
+-------------------+-------------------+--------+
|2019-11-15 20:30:00|2019-11-15 18:30:00|     4  |

预期输出:

start                     format_date                      Diff                    seq
2019-11-15 20:30:00     2019-11-15 18:30:00                  4                       1
2019-11-15 20:30:00     2019-11-15 18:30:00                  4                       2
2019-11-15 20:30:00     2019-11-15 18:30:00                  4                       3
2019-11-15 20:30:00     2019-11-15 18:30:00                  4                       4

如何根据列的值(差异)生成行?

【问题讨论】:

标签: apache-spark pyspark


【解决方案1】:

Spark 2.4 或更高版本的解决方案

from pyspark.sql import functions as F

from pyspark.sql.types import *

df= spark.createDataFrame([["2019-11-15 20:30:00","2019-11-15 18:30:00" ,4]], ["start", "format_date", "diff"])


df.select("*", F.explode(F.sequence(F.lit(1), F.col("diff"))).alias("seq")).show


+-------------------+-------------------+----+---+
|              start|        format_date|diff|seq|
+-------------------+-------------------+----+---+
|2019-11-15 20:30:00|2019-11-15 18:30:00|   4|  1|
|2019-11-15 20:30:00|2019-11-15 18:30:00|   4|  2|
|2019-11-15 20:30:00|2019-11-15 18:30:00|   4|  3|
|2019-11-15 20:30:00|2019-11-15 18:30:00|   4|  4|

【讨论】:

    【解决方案2】:

    火花

    你可以使用爆炸功能

    import pyspark.sql.functions as F
    import pyspark.sql.types as Types
    
    def rangeArr(diff):
      return range(1,diff+1)
    rangeUdf = F.udf(rangeArr, Types.ArrayType(Types.IntegerType()))
    
    df = df.withColumn('seqArr', rangeUdf('diff'))
    
    df = df.withColumn('seq', F.explode('seqArr'))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-12-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多