【发布时间】:2019-12-01 22:24:13
【问题描述】:
我有一个包含 6 列的数据框。在这里,我需要将一列值分配给另一列。需要将 ROW 列中的值放入 ItemData 列。这里所有的列都是结构类型,而不仅仅是字符串名称。
+-----+--------------------+--------------------+-------------------+--------------------------+--------------------+
|index| ROW| Document |ItemData | noNamespaceSchemaLocation| _xsi|
+-----+--------------------+--------------------+-------------------+--------------------------+--------------------+
| 0|[1,1,1018,17.0... |[[,2001-12-17T09:...| [,,,,,]| GetItemMasterSupp...|http://www.w3.org...|
+-----+--------------------+--------------------+-------------------+--------------------------+--------------------+
我尝试将 DF 注册到临时表,然后尝试交换列,但没有帮助。
The final output should look like this
+--------------------+-------------------+--------------------------+--------------------+
| Document |ItemData | noNamespaceSchemaLocation| _xsi|
+--------------------+-------------------+--------------------------+--------------------+
|[[,2001-12-17T09:...| [1,1,1018,17.0...| GetItemMasterSupp...|http://www.w3.org...|
+--------------------+-------------------+--------------------------+--------------------+
df.printschema() 这是架构
root
|-- index: long (nullable = false)
|-- ROW: struct (nullable = true)
| |-- CLTRP: long (nullable = true)
| |-- CORP: long (nullable = true)
| |-- CORP_ITEM_CD: long (nullable = true)
| |-- CTIV: double (nullable = true)
| |-- CTLFAC: string (nullable = true)
| |-- CTLI: long (nullable = true)
|-- DocData: struct (nullable = true)
| |-- Document: struct (nullable = true)
| | |-- AltementID: string (nullable = true)
| | |-- Creat: string (nullable = true)
| | |-- DataClasion: struct (nullable = true)
| | | |-- BusinessSeel: struct (nullable = true)
| | | | |-- Code: string (nullable = true)
| | | | |-- Description: string (nullable = true)
| | | |-- DataCLevel: struct (nullable = true)
| | | | |-- Code: string (nullable = true)
| | | | |-- Description: string (nullable = true)
| | | |-- PCaInd: string (nullable = true)
| | | |-- PHtaInd: string (nullable = true)
| | | |-- PPnd: string (nullable = true)
| |-- DocumentAction: struct (nullable = true)
| | |-- ActionTypeCd: string (nullable = true)
| | |-- RecordTypeCd: string (nullable = true)
|-- ItemData: struct (nullable = true)
| |-- CorpCd: string (nullable = true)
| |-- CorId: string (nullable = true)
| |-- DepId: string (nullable = true)
| |-- DisrId: string (nullable = true)
| |-- DivId: string (nullable = true)
| |-- WarId: string (nullable = true)
|-- _noNamespaceSchemaLocation: string (nullable = true)
|-- _xsi: string (nullable = true)
**
- 编辑 1:
** 更新以显示数据框的创建
//XML Data Reader
val supData="Input_File/SCI_Input.xml"
val booksFileTag1 = "ROWSET"
val dataDF = (new XmlReader()).withRowTag(booksFileTag1).xmlFile(sqlContext, supplyData).toDF()
val dataFrame1 = dataDF.withColumn("index",monotonically_increasing_id())
// XML Schema Reader
val suppySchema="Input_File/Supply_sample.xml"
val booksFileTag = "GetItemMaster"
val schemaDf = (new XmlReader()).withRowTag(booksFileTag).xmlFile(sqlContext, suppySchema).toDF()
val dataFrame2 = schemaDf.withColumn("index",monotonically_increasing_id())
val finalDf = dataFrame1.join(dataFrame2,"index")
finalDf.show()
Output for reference for @JXC
|-- ItemData: struct (nullable = true)
| |-- CLTRP: long (nullable = true)
| |-- CORP: long (nullable = true)
| |-- CORP_ITEM_CD: long (nullable = true)
| |-- CTIV: double (nullable = true)
| |-- CTLFAC: string (nullable = true)
| |-- CTLI: long (nullable = true)
【问题讨论】:
-
new_df = df.selectExpr('Document', 'ROW AS ItemData', 'noNamespaceSchemaLocation', '_xsi') -
@user1638818- 你也可以为你的数据框发布创建语句吗?
-
@vikrantrana -- 用数据框创建部分编辑了我的问题
-
@jxc -- 感谢您的回答,我尝试按照您的建议进行操作,但此方法存在问题,如果我选择“ROW”AS ItemData,那么我的项目数据架构也会得到更改为我不想要的 ROW 列。我需要有 ItemData 的模式,如我的问题(df.printschema())中所示,只有值应该移动到 ItemData 列。尝试上述代码后,ItemData 的架构如下所示(在原始问题中显示)
标签: scala apache-spark pyspark apache-spark-sql