Spark：对字段的模糊引用答案

【问题标题】：Spark: Ambiguous reference to fieldsSpark：对字段的模糊引用
【发布时间】：2021-11-23 13:16:20
【问题描述】：

我在尝试展平高度嵌套的结构时遇到以下错误：

org.apache.spark.sql.AnalysisException：对字段的不明确引用 StructField(error,StructType(StructField(array,ArrayType(StructType(StructField(double,DoubleType,true), StructField(int,IntegerType,true), StructField (string,StringType,true)),true),true), StructField(double,DoubleType,true), StructField(int,IntegerType,true), StructField(string,StringType,true), StructField(struct,StructType(StructField( message,StringType,true), StructField(kind,StringType,true), StructField(stack,StringType,true)),true)),true), StructField(错误,StructType(StructField(array,ArrayType(StringType,true), true), StructField(string,StringType,true)),true)

我似乎无法弄清楚具体是什么原因造成的。除了深度嵌套的结构之外，还有什么歧义？

【问题讨论】：

stackoverflow.com/questions/66462194/… 的可能重复项查看链接问题中的架构。您可能在同一级别上有两个具有相同名称的字段。此外，当您遇到问题并写信给 SO 时，请提供模式和数据框的示例。

标签： apache-spark pyspark

【解决方案1】：

当您在 2 个数据帧之间进行连接时会发生这种情况，并且两个数据帧都有一个同名的字段。当您调用重复字段时，Spark 不知道您请求的是哪一列。解决方案：重命名连接的一侧的字段，就完成了。示例

dfA 是一个有 2 列的数据框 => (id,name)
dfB 是一个包含 3 列的数据框 => (id,name,description)

您正在按列“id”连接两个数据框，并且您想在第二个中选择“名称”列：

val dfJoined = dfA.join(dfB,Seq("id"),"inner").select("name")

由于两个数据帧中都存在“名称”列，Spark 无法识别您要求的“名称”。

解决方案：

val dfRenamedB = dfB.withColumnRenamed("name","b_name")

现在，当您加入两个数据框时，您将获得“name”和“b_name”列，并且您可以确定哪一个是选定的。

【讨论】：

重命名字段并非绝对必要。也可以这样做：` df = df.join(df1, ['', 'inner']) ` 这将只保留一列，而不保留另一列
并非如此。我们不是在谈论用于连接的字段，而是关于两个数据帧中的字段。在您的示例中，“”是用于连接两个数据框的字段，因此生成的数据框只有此一次。但是，如果您在两个数据帧中都有相同的字段，与加入的字段不同，则在两个数据帧中都会引发错误，因为 Spark 不知道您请求的字段。