【发布时间】:2022-01-26 08:10:24
【问题描述】:
我是 Spark SQL(使用 Scala)的新手,并且对我面临的错误有一些基本的问题。 我正在合并 2 个数据帧(oldData 和 newData),如下所示
if (!oldData.isEmpty) {
oldData
.join(newData, Seq("internalUUID"),"left_anti")
.unionByName(newData)
.drop("all") //Drop records that have null in all fields
} else {
newData
}
我看到的错误是
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ....
at the 8th column of the second table;;
'Union
:- Project [internalUUID#342, TenantID#339, ObjectName#340, DataSource#341, product#343, plant#344, isMarkedForDeletion#345, distributionProfile#346, productionAspect#347, salesPlant#348, listing#349]
: +- Join LeftAnti, (internalUUID#342 = internalUUID#300)
: :- Relation[TenantID#339,ObjectName#340,DataSource#341,internalUUID#342,product#343,plant#344,isMarkedForDeletion#345,distributionProfile#346,productionAspect#347,salesPlant#348,listing#349] parquet
: +- LogicalRDD [DataSource#296, ObjectName#297, TenantID#298, distributionProfile#299, internalUUID#300, isMarkedForDeletion#301, listing#302, plant#303, product#304, productionAspect#305, salesPlant#306], false
+- Project [internalUUID#300, TenantID#298, ObjectName#297, DataSource#296, product#304, plant#303, isMarkedForDeletion#301, distributionProfile#299, productionAspect#305, salesPlant#306, listing#302]
+- LogicalRDD [DataSource#296, ObjectName#297, TenantID#298, distributionProfile#299, internalUUID#300, isMarkedForDeletion#301, listing#302, plant#303, product#304, productionAspect#305, salesPlant#306], false
架构结构如下: 旧数据
root
|-- TenantID: string (nullable = true)
|-- ObjectName: string (nullable = true)
|-- DataSource: string (nullable = true)
|-- internalUUID: string (nullable = true)
|-- product: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- plant: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- isMarkedForDeletion: boolean (nullable = true)
|-- distributionProfile: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- productionAspect: struct (nullable = true)
| |-- productMovementPlants: struct (nullable = true)
| | |-- unitOfIssue: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| |-- productPlanningPlants: struct (nullable = true)
| | |-- goodsReceiptProcessDuration: long (nullable = true)
| | |-- goodsIssueProcessDuration: long (nullable = true)
| | |-- mrpType: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- mrpController: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- sourceOfSupplyCategory: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- abcIndicator: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
|-- salesPlant: struct (nullable = true)
| |-- loadingGroup: struct (nullable = true)
| | |-- code: string (nullable = true)
| | |-- internalRefUUID: string (nullable = true)
|-- listing: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- validFrom: string (nullable = true)
| | |-- validTo: string (nullable = true)
| | |-- isListed: boolean (nullable = true)
和新数据
root
|-- DataSource: string (nullable = true)
|-- ObjectName: string (nullable = true)
|-- TenantID: string (nullable = true)
|-- distributionProfile: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- internalUUID: string (nullable = true)
|-- isMarkedForDeletion: boolean (nullable = true)
|-- listing: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- isListed: boolean (nullable = true)
| | |-- validFrom: string (nullable = true)
| | |-- validTo: string (nullable = true)
|-- plant: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- product: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- productionAspect: struct (nullable = true)
| |-- productMovementPlants: struct (nullable = true)
| | |-- unitOfIssue: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| |-- productPlanningPlants: struct (nullable = true)
| | |-- abcIndicator: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- goodsIssueProcessDuration: long (nullable = true)
| | |-- goodsReceiptProcessDuration: long (nullable = true)
| | |-- mrpController: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- mrpType: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- sourceOfSupplyCategory: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
|-- salesPlant: struct (nullable = true)
| |-- loadingGroup: struct (nullable = true)
| | |-- code: string (nullable = true)
| | |-- internalRefUUID: string (nullable = true)
但是我不太确定“第二张表的第 8 列”是什么意思?此外,两个数据帧中的列的排序方式不同。是否有关于如何进行此操作的指导?
【问题讨论】:
标签: scala apache-spark apache-spark-sql