【问题标题】:Getting "org.apache.spark.sql.AnalysisException" when creating Dataset from RDD从 RDD 创建数据集时获取“org.apache.spark.sql.AnalysisException”
【发布时间】:2023-03-25 19:54:01
【问题描述】:

我最近开始使用 Spark 的 Dataset API,我正在尝试一些示例。以下是一个这样的示例,它以 AnalysisException 失败。

case class Fruits(name: String, quantity: Int)

val source = Array(("mango", 1), ("Guava", 2), ("mango", 2), ("guava", 2))
val sourceDS = spark.createDataset(source).as[Fruits]
// or val sourceDS = spark.sparkContext.parallelize(source).toDS().as[Fruits]
val resultDS = sourceDS.filter(_.name == "mango").filter(_.quantity > 1)

执行上述代码时,我得到:

19/06/02 18:04:42 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/06/02 18:04:42 INFO CodeGenerator: Code generated in 405.026891 ms
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`name`' given input columns: [_1, _2];
    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:110)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:107)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:278)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:278)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:295)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:354)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike.map(TraversableLike.scala:237)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
    at scala.collection.immutable.List.map(List.scala:298)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:354)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp$1(QueryPlan.scala:93)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:105)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:105)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:116)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:126)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:93)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:107)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:85)
    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:85)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:82)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolveAndBind(ExpressionEncoder.scala:258)
    at org.apache.spark.sql.Dataset.deserializer$lzycompute(Dataset.scala:214)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$deserializer(Dataset.scala:213)
    at org.apache.spark.sql.Dataset$.apply(Dataset.scala:72)
    at org.apache.spark.sql.Dataset.as(Dataset.scala:431)
    at SocketStreamWordcountApp$.main(SocketStreamWordcountApp.scala:20)
    at SocketStreamWordcountApp.main(SocketStreamWordcountApp.scala)
19/06/02 18:04:43 INFO SparkContext: Invoking stop() from shutdown hook

我认为当我们尝试使用as[T] 创建新数据集或将 RDD 转换为数据集时,它会起作用。不是这样吗?

只是为了实验,我尝试创建一个 Dataframe 并将 Dataframe 转换为 Dataset,如下所示,但我最终遇到了同样的错误。

val sourceDS = spark.sparkContext.parallelize(source).toDF().as[Fruits]
// or val sourceDS = spark.createDataFrame(source).as[Fruits]

任何帮助将不胜感激。

【问题讨论】:

    标签: apache-spark rdd apache-spark-dataset


    【解决方案1】:

    输入DataFrame 的列名必须与案例类的字段名匹配。所以你要么需要中级Dataset[Row]

    val sourceDS = spark.createDataset(source).toDF("name", "quantity").as[Fruits]
    

    或者一路走。

    当然,合理的解决方案是从一开始就以Fruits 开头。

    val source = Array(Fruits("mango", 1), Fruits("Guava", 2), Fruits("mango", 2), Fruits("guava", 2))
    

    【讨论】:

      【解决方案2】:

      从 spark 2.3 开始,数据框的列名应与案例类参数的名称匹配。而对于以前的版本(2.1.1),唯一的约束是相同数量的列/参数。 您可以通过这种方式创建一个 Fruits 序列而不是元组:

      case class Fruits(name: String, quantity: Int)
      
      val source = Array(Fruits("mango", 1), Fruits("Guava", 2), Fruits("mango", 2), Fruits("guava", 2))
      val sourceDS = spark.createDataset(source)
      val resultDS = sourceDS.filter(_.name == "mango").filter(_.quantity
      

      【讨论】:

        【解决方案3】:

        我认为@user11589880 的答案可行,但我有一个替代方案供您考虑:

        val sourceDS = Seq(Fruit("Mango", 1), Fruit("Guava", 2)).toDF
        

        sourceDS 的类型将是 Dataset[Fruit]

        【讨论】:

          猜你喜欢
          • 2016-09-27
          • 2019-05-04
          • 1970-01-01
          • 2021-12-10
          • 2019-07-14
          • 2019-08-11
          • 2017-06-02
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多