使用 unionByName 时出现错误不兼容的列类型答案

【问题标题】：Error incompatible column type when using unionByName使用 unionByName 时出现错误不兼容的列类型
【发布时间】：2022-01-26 08:10:24
【问题描述】：

我是 Spark SQL（使用 Scala）的新手，并且对我面临的错误有一些基本的问题。我正在合并 2 个数据帧（oldData 和 newData），如下所示

if (!oldData.isEmpty) {
      oldData
        .join(newData, Seq("internalUUID"),"left_anti")
        .unionByName(newData)
        .drop("all") //Drop records that have null in all fields
    } else {
      newData
    }

我看到的错误是

org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ....
 at the 8th column of the second table;;
'Union
:- Project [internalUUID#342, TenantID#339, ObjectName#340, DataSource#341, product#343, plant#344, isMarkedForDeletion#345, distributionProfile#346, productionAspect#347, salesPlant#348, listing#349]
:  +- Join LeftAnti, (internalUUID#342 = internalUUID#300)
:     :- Relation[TenantID#339,ObjectName#340,DataSource#341,internalUUID#342,product#343,plant#344,isMarkedForDeletion#345,distributionProfile#346,productionAspect#347,salesPlant#348,listing#349] parquet
:     +- LogicalRDD [DataSource#296, ObjectName#297, TenantID#298, distributionProfile#299, internalUUID#300, isMarkedForDeletion#301, listing#302, plant#303, product#304, productionAspect#305, salesPlant#306], false
+- Project [internalUUID#300, TenantID#298, ObjectName#297, DataSource#296, product#304, plant#303, isMarkedForDeletion#301, distributionProfile#299, productionAspect#305, salesPlant#306, listing#302]
   +- LogicalRDD [DataSource#296, ObjectName#297, TenantID#298, distributionProfile#299, internalUUID#300, isMarkedForDeletion#301, listing#302, plant#303, product#304, productionAspect#305, salesPlant#306], false

架构结构如下：旧数据

root
 |-- TenantID: string (nullable = true)
 |-- ObjectName: string (nullable = true)
 |-- DataSource: string (nullable = true)
 |-- internalUUID: string (nullable = true)
 |-- product: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- internalRefUUID: string (nullable = true)
 |-- plant: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- internalRefUUID: string (nullable = true)
 |-- isMarkedForDeletion: boolean (nullable = true)
 |-- distributionProfile: struct (nullable = true)
 |    |-- code: string (nullable = true)
 |    |-- internalRefUUID: string (nullable = true)
 |-- productionAspect: struct (nullable = true)
 |    |-- productMovementPlants: struct (nullable = true)
 |    |    |-- unitOfIssue: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |-- productPlanningPlants: struct (nullable = true)
 |    |    |-- goodsReceiptProcessDuration: long (nullable = true)
 |    |    |-- goodsIssueProcessDuration: long (nullable = true)
 |    |    |-- mrpType: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |    |-- mrpController: struct (nullable = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |    |-- sourceOfSupplyCategory: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |    |-- abcIndicator: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |-- salesPlant: struct (nullable = true)
 |    |-- loadingGroup: struct (nullable = true)
 |    |    |-- code: string (nullable = true)
 |    |    |-- internalRefUUID: string (nullable = true)
 |-- listing: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- validFrom: string (nullable = true)
 |    |    |-- validTo: string (nullable = true)
 |    |    |-- isListed: boolean (nullable = true)

和新数据

root
 |-- DataSource: string (nullable = true)
 |-- ObjectName: string (nullable = true)
 |-- TenantID: string (nullable = true)
 |-- distributionProfile: struct (nullable = true)
 |    |-- code: string (nullable = true)
 |    |-- internalRefUUID: string (nullable = true)
 |-- internalUUID: string (nullable = true)
 |-- isMarkedForDeletion: boolean (nullable = true)
 |-- listing: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- isListed: boolean (nullable = true)
 |    |    |-- validFrom: string (nullable = true)
 |    |    |-- validTo: string (nullable = true)
 |-- plant: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- internalRefUUID: string (nullable = true)
 |-- product: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- internalRefUUID: string (nullable = true)
 |-- productionAspect: struct (nullable = true)
 |    |-- productMovementPlants: struct (nullable = true)
 |    |    |-- unitOfIssue: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |-- productPlanningPlants: struct (nullable = true)
 |    |    |-- abcIndicator: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |    |-- goodsIssueProcessDuration: long (nullable = true)
 |    |    |-- goodsReceiptProcessDuration: long (nullable = true)
 |    |    |-- mrpController: struct (nullable = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |    |-- mrpType: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |    |-- sourceOfSupplyCategory: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |-- salesPlant: struct (nullable = true)
 |    |-- loadingGroup: struct (nullable = true)
 |    |    |-- code: string (nullable = true)
 |    |    |-- internalRefUUID: string (nullable = true)

但是我不太确定“第二张表的第 8 列”是什么意思？此外，两个数据帧中的列的排序方式不同。是否有关于如何进行此操作的指导？

【问题讨论】：

标签： scala apache-spark apache-spark-sql

【解决方案1】：

使用unionByName 时，顺序无关紧要，因为它使用列名进行解析。但这仅适用于根列（df.columns 返回的列），不适用于嵌套列。

在您的情况下，您会收到该错误，因为您有一些列类型在 2 个数据帧之间不匹配。

我们可以以listing列为例：

newData => array<struct<isListed:boolean,validFrom:string,validTo:string>>

oldData => array<struct<validFrom:string,validTo:string,isListed:boolean>>

在StructType 中，字段的顺序和类型很重要。您可以使用以下简单代码来查看它：

val oldListing = new StructType().add("isListed", "boolean").add("validFrom", "string").add("validTo", "string")
val newListing = new StructType().add("validFrom", "string").add("validTo", "string").add("isListed", "boolean")

oldListing == newListing
//res239: Boolean = false

【讨论】：