【发布时间】:2021-03-01 18:40:27
【问题描述】:
我有这样的数据:
>>> data = sc.parallelize([[1,5,10,0,[1,2,3,4,5,6]],[0,10,20,1,[2,3,4,5,6,7]],[1,15,25,0,[3,4,5,6,7,8]],[0,30,40,1,[4,5,6,7,8,9]]]).toDF(('a','b','c',"d","e"))
>>> data.show()
+---+---+---+---+------------------+
| a| b| c| d| e|
+---+---+---+---+------------------+
| 1| 5| 10| 0|[1, 2, 3, 4, 5, 6]|
| 0| 10| 20| 1|[2, 3, 4, 5, 6, 7]|
| 1| 15| 25| 0|[3, 4, 5, 6, 7, 8]|
| 0| 30| 40| 1|[4, 5, 6, 7, 8, 9]|
+---+---+---+---+------------------+
# colums should be kept in result
keep_cols = ["a","b"]
# column 'e' should be split into split_e_cols
split_e_cols = ["one","two","three","four","five","six"]
# I hope the result dataframe has keep_cols + split_res_cols
我想将e 列拆分为多个列,同时保留a 和b 列。
我试过了:
data.select(*(col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(len(split_e_cols)))))
和
data.select("e").rdd.flatMap(lambda x:x).toDF(split_e_cols)
两者都不能保留a 和b 列。
谁能帮助我?谢谢。
【问题讨论】:
标签: apache-spark pyspark apache-spark-sql