【问题标题】:Why accesing DataFrame from UDF results in NullPointerException?为什么从 UDF 访问 DataFrame 会导致 NullPointerException?
【发布时间】:2018-04-17 02:53:24
【问题描述】:

我在执行 Spark 应用程序时遇到问题。

源代码:

// Read table From HDFS
val productInformation = spark.table("temp.temp_table1")
val dict = spark.table("temp.temp_table2")

// Custom UDF
val countPositiveSimilarity = udf[Long, Seq[String], Seq[String]]((a, b) => 
    dict.filter(
        (($"first".isin(a: _*) && $"second".isin(b: _*)) || ($"first".isin(b: _*) && $"second".isin(a: _*))) && $"similarity" > 0.7
    ).count
)

val result = productInformation.withColumn("positive_count", countPositiveSimilarity($"title", $"internal_category"))

// Error occurs!
result.show

错误信息:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 54.0 failed 4 times, most recent failure: Lost task 0.3 in stage 54.0 (TID 5887, ip-10-211-220-33.ap-northeast-2.compute.internal, executor 150): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array<string>, array<string>) => bigint)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at $anonfun$1.apply(<console>:45)
    at $anonfun$1.apply(<console>:43)
    ... 16 more

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
  ... 48 elided
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array<string>, array<string>) => bigint)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  ... 3 more
Caused by: java.lang.NullPointerException
  at $anonfun$1.apply(<console>:45)
  at $anonfun$1.apply(<console>:43)
  ... 16 more

我检查了productInformationdictColumns 中是否有空值。但是没有空值。

谁能帮帮我? 我附上了示例代码,让您了解更多细节:

case class Target(wordListOne: Seq[String], WordListTwo: Seq[String])
val targetData = Seq(Target(Seq("Spark", "Wrong", "Something"), Seq("Java", "Grape", "Banana")),
                     Target(Seq("Java", "Scala"), Seq("Scala", "Banana")),
                     Target(Seq(""), Seq("Grape", "Banana")),
                     Target(Seq(""), Seq("")))
val targets = spark.createDataset(targetData)

case class WordSimilarity(first: String, second: String, similarity: Double)
val similarityData = Seq(WordSimilarity("Spark", "Java", 0.8), 
                     WordSimilarity("Scala", "Spark", 0.9), 
                     WordSimilarity("Java", "Scala", 0.9),
                     WordSimilarity("Apple", "Grape", 0.66),
                     WordSimilarity("Scala", "Apple", -0.1),
                     WordSimilarity("Gine", "Spark", 0.1)) 
val dict = spark.createDataset(similarityData)

val countPositiveSimilarity = udf[Long, Seq[String], Seq[String]]((a, b) => 
    dict.filter(
        (($"first".isin(a: _*) && $"second".isin(b: _*)) || ($"first".isin(b: _*) && $"second".isin(a: _*))) && $"similarity" > 0.7
    ).count
)

val countDF = targets.withColumn("positive_count", countPositiveSimilarity($"wordListOne", $"wordListTwo"))

这是一个示例代码,与我的原始代码类似。 示例代码运行良好。我应该检查原始代码和数据的哪一点?

【问题讨论】:

    标签: scala apache-spark


    【解决方案1】:

    非常有趣的问题。我必须做一些搜索,这是我的。希望对您有所帮助。

    当您通过createDataset 创建Dataset 时,spark 将为该数据集分配LocalRelation 逻辑查询计划。

    def createDataset[T : Encoder](data: Seq[T]): Dataset[T] = {
        val enc = encoderFor[T]
        val attributes = enc.schema.toAttributes
        val encoded = data.map(d => enc.toRow(d).copy())
        val plan = new LocalRelation(attributes, encoded)
        Dataset[T](self, plan)
      }
    

    关注linkLocalRelation is a leaf logical plan that allow functions like collect or take to be executed locally, i.e. without using Spark executors.

    而且,正如isLocal 方法指出的那样,这是真的

     /**
       * Returns true if the `collect` and `take` methods can be run locally
       * (without any Spark executors).
       *
       * @group basic
       * @since 1.6.0
       */
      def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
    

    显然,您可以检查您的 2 个数据集是本地的。

    而且,show 方法实际上在内部调用了take

    private[sql] def showString(_numRows: Int, truncate: Int = 20): String = {
        val numRows = _numRows.max(0)
        val takeResult = toDF().take(numRows + 1)
        val hasMoreData = takeResult.length > numRows
        val data = takeResult.take(numRows)
    

    因此,在这些情况下,我认为调用 countDF.show 已执行,它的行为类似于您在 driverdict 数据集上调用 count 时,调用次数为targets 的记录数。而且,dict 数据集当然不需要在 countDF 工作的本地显示。

    您可以尝试保存countDF,它会给您与第一种情况相同的异常 org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array&lt;string&gt;, array&lt;string&gt;) =&gt; bigint)

    【讨论】:

    • 感谢您的回答。我有两个问题。 1)您的主要观点是,换句话说,当我执行保存或显示方法时,示例代码中会出现错误?我是入门级 Spark 用户,所以我认为这可能是一个愚蠢的问题。 2)除此之外,还有没有办法在udf中使用Dataframe?
    • 1.你的第一个案例,两者都会失败。您的第二种情况,保存将失败。 2.数据集可以是udf,即编译就好。但是,如果调用数据集不是本地的并且操作不是takecollect,它将在运行时失败,就像您的第一种情况一样。
    【解决方案2】:

    您不能在udf 中使用Dataframe。你需要加入productInformationdict,加入后做udf逻辑。

    【讨论】:

    • 是否可以在UDF中使用由dict(DataFrame)转换的不可变Map?目前,productInformation和dict很难加入。
    • 如果dict df 不是太大,您可以收集它,转换为Map,然后在udf 中使用该映射
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-08-15
    • 1970-01-01
    • 1970-01-01
    • 2011-01-27
    • 2012-01-30
    • 1970-01-01
    相关资源
    最近更新 更多