【问题标题】:how to split one column and keep other columns in pyspark dataframe?如何拆分一列并将其他列保留在pyspark数据框中?
【发布时间】:2021-03-01 18:40:27
【问题描述】:

我有这样的数据:

>>> data = sc.parallelize([[1,5,10,0,[1,2,3,4,5,6]],[0,10,20,1,[2,3,4,5,6,7]],[1,15,25,0,[3,4,5,6,7,8]],[0,30,40,1,[4,5,6,7,8,9]]]).toDF(('a','b','c',"d","e"))
>>> data.show()
+---+---+---+---+------------------+
|  a|  b|  c|  d|                 e|
+---+---+---+---+------------------+
|  1|  5| 10|  0|[1, 2, 3, 4, 5, 6]|
|  0| 10| 20|  1|[2, 3, 4, 5, 6, 7]|
|  1| 15| 25|  0|[3, 4, 5, 6, 7, 8]|
|  0| 30| 40|  1|[4, 5, 6, 7, 8, 9]|
+---+---+---+---+------------------+
# colums should be kept in result
keep_cols = ["a","b"]
# column 'e' should be split into split_e_cols
split_e_cols = ["one","two","three","four","five","six"]
# I hope the result dataframe has keep_cols + split_res_cols

我想将e 列拆分为多个列,同时保留ab 列。

我试过了:

data.select(*(col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(len(split_e_cols)))))

data.select("e").rdd.flatMap(lambda x:x).toDF(split_e_cols)

两者都不能保留ab 列。

谁能帮助我?谢谢。

【问题讨论】:

    标签: apache-spark pyspark apache-spark-sql


    【解决方案1】:

    试试这个:

    select_cols = [col(c) for c in keep_cols] + [col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(split_e_cols))]
    
    data.select(*select_cols).show()
    
    #+---+---+---+---+-----+----+----+---+
    #|  a|  b|one|two|three|four|five|six|
    #+---+---+---+---+-----+----+----+---+
    #|  1|  5|  1|  2|    3|   4|   5|  6|
    #|  0| 10|  2|  3|    4|   5|   6|  7|
    #|  1| 15|  3|  4|    5|   6|   7|  8|
    #|  0| 30|  4|  5|    6|   7|   8|  9|
    #+---+---+---+---+-----+----+----+---+
    

    或者使用for循环和withColumn:

    data = data.select(keep_cols + ["e"])
    
    for i in range(len(split_e_cols)):
        data = data.withColumn(split_e_cols[i], col("e").getItem(i))
    
    data.drop("e").show()
    

    【讨论】:

      【解决方案2】:

      您可以使用+ 连接列表:

      from pyspark.sql.functions import col
      
      data.select(
          keep_cols +
          [col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(split_e_cols))]
      ).show()
      +---+---+---+---+-----+----+----+---+
      |  a|  b|one|two|three|four|five|six|
      +---+---+---+---+-----+----+----+---+
      |  1|  5|  1|  2|    3|   4|   5|  6|
      |  0| 10|  2|  3|    4|   5|   6|  7|
      |  1| 15|  3|  4|    5|   6|   7|  8|
      |  0| 30|  4|  5|    6|   7|   8|  9|
      +---+---+---+---+-----+----+----+---+
      

      更pythonic的方法是使用enumerate而不是range(len())

      from pyspark.sql.functions import col
      
      data.select(
          keep_cols +
          [col("e").getItem(i).alias(c) for (i, c) in enumerate(split_e_cols)]
      ).show()
      +---+---+---+---+-----+----+----+---+
      |  a|  b|one|two|three|four|five|six|
      +---+---+---+---+-----+----+----+---+
      |  1|  5|  1|  2|    3|   4|   5|  6|
      |  0| 10|  2|  3|    4|   5|   6|  7|
      |  1| 15|  3|  4|    5|   6|   7|  8|
      |  0| 30|  4|  5|    6|   7|   8|  9|
      +---+---+---+---+-----+----+----+---+
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2020-11-30
        • 2020-04-21
        • 1970-01-01
        • 1970-01-01
        • 2021-12-05
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多