【问题标题】:Error incompatible column type when using unionByName使用 unionByName 时出现错误不兼容的列类型
【发布时间】:2022-01-26 08:10:24
【问题描述】:

我是 Spark SQL(使用 Scala)的新手,并且对我面临的错误有一些基本的问题。 我正在合并 2 个数据帧(oldData 和 newData),如下所示

if (!oldData.isEmpty) {
      oldData
        .join(newData, Seq("internalUUID"),"left_anti")
        .unionByName(newData)
        .drop("all") //Drop records that have null in all fields
    } else {
      newData
    }

我看到的错误是

org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ....
 at the 8th column of the second table;;
'Union
:- Project [internalUUID#342, TenantID#339, ObjectName#340, DataSource#341, product#343, plant#344, isMarkedForDeletion#345, distributionProfile#346, productionAspect#347, salesPlant#348, listing#349]
:  +- Join LeftAnti, (internalUUID#342 = internalUUID#300)
:     :- Relation[TenantID#339,ObjectName#340,DataSource#341,internalUUID#342,product#343,plant#344,isMarkedForDeletion#345,distributionProfile#346,productionAspect#347,salesPlant#348,listing#349] parquet
:     +- LogicalRDD [DataSource#296, ObjectName#297, TenantID#298, distributionProfile#299, internalUUID#300, isMarkedForDeletion#301, listing#302, plant#303, product#304, productionAspect#305, salesPlant#306], false
+- Project [internalUUID#300, TenantID#298, ObjectName#297, DataSource#296, product#304, plant#303, isMarkedForDeletion#301, distributionProfile#299, productionAspect#305, salesPlant#306, listing#302]
   +- LogicalRDD [DataSource#296, ObjectName#297, TenantID#298, distributionProfile#299, internalUUID#300, isMarkedForDeletion#301, listing#302, plant#303, product#304, productionAspect#305, salesPlant#306], false

架构结构如下: 旧数据

root
 |-- TenantID: string (nullable = true)
 |-- ObjectName: string (nullable = true)
 |-- DataSource: string (nullable = true)
 |-- internalUUID: string (nullable = true)
 |-- product: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- internalRefUUID: string (nullable = true)
 |-- plant: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- internalRefUUID: string (nullable = true)
 |-- isMarkedForDeletion: boolean (nullable = true)
 |-- distributionProfile: struct (nullable = true)
 |    |-- code: string (nullable = true)
 |    |-- internalRefUUID: string (nullable = true)
 |-- productionAspect: struct (nullable = true)
 |    |-- productMovementPlants: struct (nullable = true)
 |    |    |-- unitOfIssue: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |-- productPlanningPlants: struct (nullable = true)
 |    |    |-- goodsReceiptProcessDuration: long (nullable = true)
 |    |    |-- goodsIssueProcessDuration: long (nullable = true)
 |    |    |-- mrpType: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |    |-- mrpController: struct (nullable = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |    |-- sourceOfSupplyCategory: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |    |-- abcIndicator: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |-- salesPlant: struct (nullable = true)
 |    |-- loadingGroup: struct (nullable = true)
 |    |    |-- code: string (nullable = true)
 |    |    |-- internalRefUUID: string (nullable = true)
 |-- listing: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- validFrom: string (nullable = true)
 |    |    |-- validTo: string (nullable = true)
 |    |    |-- isListed: boolean (nullable = true)

和新数据

root
 |-- DataSource: string (nullable = true)
 |-- ObjectName: string (nullable = true)
 |-- TenantID: string (nullable = true)
 |-- distributionProfile: struct (nullable = true)
 |    |-- code: string (nullable = true)
 |    |-- internalRefUUID: string (nullable = true)
 |-- internalUUID: string (nullable = true)
 |-- isMarkedForDeletion: boolean (nullable = true)
 |-- listing: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- isListed: boolean (nullable = true)
 |    |    |-- validFrom: string (nullable = true)
 |    |    |-- validTo: string (nullable = true)
 |-- plant: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- internalRefUUID: string (nullable = true)
 |-- product: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- internalRefUUID: string (nullable = true)
 |-- productionAspect: struct (nullable = true)
 |    |-- productMovementPlants: struct (nullable = true)
 |    |    |-- unitOfIssue: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |-- productPlanningPlants: struct (nullable = true)
 |    |    |-- abcIndicator: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |    |-- goodsIssueProcessDuration: long (nullable = true)
 |    |    |-- goodsReceiptProcessDuration: long (nullable = true)
 |    |    |-- mrpController: struct (nullable = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |    |-- mrpType: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |    |    |-- sourceOfSupplyCategory: struct (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- internalRefUUID: string (nullable = true)
 |-- salesPlant: struct (nullable = true)
 |    |-- loadingGroup: struct (nullable = true)
 |    |    |-- code: string (nullable = true)
 |    |    |-- internalRefUUID: string (nullable = true)

但是我不太确定“第二张表的第 8 列”是什么意思?此外,两个数据帧中的列的排序方式不同。是否有关于如何进行此操作的指导?

【问题讨论】:

    标签: scala apache-spark apache-spark-sql


    【解决方案1】:

    使用unionByName 时,顺序无关紧要,因为它使用列名进行解析。但这仅适用于根列(df.columns 返回的列),不适用于嵌套列。

    在您的情况下,您会收到该错误,因为您有一些列类型在 2 个数据帧之间不匹配。

    我们可以以listing列为例:

    newData => array<struct<isListed:boolean,validFrom:string,validTo:string>>

    oldData => array<struct<validFrom:string,validTo:string,isListed:boolean>>

    StructType 中,字段的顺序和类型很重要。您可以使用以下简单代码来查看它:

    val oldListing = new StructType().add("isListed", "boolean").add("validFrom", "string").add("validTo", "string")
    val newListing = new StructType().add("validFrom", "string").add("validTo", "string").add("isListed", "boolean")
    
    oldListing == newListing
    //res239: Boolean = false
    

    【讨论】:

      猜你喜欢
      • 2018-03-05
      • 1970-01-01
      • 2015-10-04
      • 2019-11-20
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-12-18
      • 1970-01-01
      相关资源
      最近更新 更多