Pyspark：将列分解为新的数据框答案

【问题标题】：Pyspark: explode columns to new dataframePyspark：将列分解为新的数据框
【发布时间】：2020-04-23 17:45:12
【问题描述】：

我有一些带有架构的 pyspark 数据框：

 |-- doc_id: string (nullable = true)     
 |-- msp_contracts: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _el1: string (nullable = true)
 |    |    |-- _el2: long (nullable = true)
 |    |    |-- _el3: string (nullable = true)
 |    |    |-- _el4: string (nullable = true)
 |    |    |-- _el5: string (nullable = true)

我如何得到这个数据框：

|-- doc_id: string (nullable = true)
|-- _el1: string (nullable = true)
|-- _el3: string (nullable = true)
|-- _el4: string (nullable = true)
|-- _el5: string (nullable = true)

我尝试选择：

explode('msp_contracts').select(
 col(u'msp_contracts.element._el1'),
 col(u'msp_contracts.element._el2')
)

但我可能有错误：

'Column' object is not callable

【问题讨论】：

试试：df.selectExpr("inline_outer(msp_contracts)").drop("_VALUE", "_el2").show()

标签： python pyspark

【解决方案1】：

在explode('msp_contracts') 之后，spark 将添加 col 列作为爆炸的结果（如果未提供别名）。

df.select("doc_id",explode("msp_contracts")).show()
#+------+---+
#|doc_id|col|
#+------+---+
#|     1|[1]|
#+------+---+

使用col选择_el1，试试df_1.select("doc_id",explode("msp_contracts")).select("doc_id",col(u"col._el1")).show()

Example:

jsn='{"doc_id":1,"msp_contracts":[{"_el1":1}]}'
df=spark.read.json(sc.parallelize([(jsn)]))

#schema
#root
# |-- doc_id: long (nullable = true)
# |-- msp_contracts: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- _el1: long (nullable = true)

df.withColumn("msp_contracts",explode(col("msp_contracts"))).\
select("doc_id","msp_contracts._el1").show()
#+------+----+
#|doc_id|_el1|
#+------+----+
#|     1|   1|
#+------+----+

UPDATE:

df.select("doc_id",explode("msp_contracts")).\
select("doc_id","col._el1").\
show()
#or
df.select("doc_id",explode("msp_contracts")).\
select("doc_id",col(u"col._el1")).\
show()
#+------+----+
#|doc_id|_el1|
#+------+----+
#|     1|   1|
#+------+----+

【讨论】：

这对我不起作用。我需要在另一个数据框中选择。当我使用您的代码进行选择时出现错误：无法解析 'msp_contracts.el1
select() 确实接受col()，不确定你的意思。
我有 df_1，需要用 df_1 中的一些列制作 df_2。当我尝试 make col('doc_id'), explode(col('msp_contracts')).select( 'msp_contracts.element.el1', 'msp_contracts.element.el2' ), 我有错误：'msp_contracts.element.el2' TypeError: 'Column' object is not callable
@АнтонБукреев，请查看我的Updated 答案！
@CPak，我被忽视了.select 接受col("<col_name>"),"<col_name>"。

【解决方案2】：

为我工作：

df.select("doc_id",explode("msp_contracts")).\ 
   select("doc_id","col._el1")

带别名和服装栏：

df.select(
        'doc_id',
        explode('msp_contracts').alias("msp_contracts")
        )\
        .select(
            'doc_id',
            col('msp_contracts.el_1').alias('last_period_44fz_customer'),
            col('msp_contracts.el_2').alias('last_period_44fz_customer_inn')
        )\
        .withColumn("load_dtm", now_f())

【讨论】：