【问题标题】:NPE while reading ORC file using Spark 1.4 API使用 Spark 1.4 API 读取 ORC 文件时的 NPE
【发布时间】:2015-12-04 08:47:39
【问题描述】:

我在 Spark 中读取了许多 ORC 文件并对其进行处理,这些文件基本上是 Hive 分区。大多数时候处理进展顺利,但对于少数文件,我得到以下异常,不知道为什么?这些文件在使用 Hive 查询的 Hive 中运行良好。

DataFrame df = hiveContext.read().format("orc").load("/path/in/hdfs");

java.lang.NullPointerException
    at org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:402)
    at org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:206)
    at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$8.apply(OrcRelation.scala:238)
    at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$8.apply(OrcRelation.scala:238)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:238)
    at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:290)
    at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:288)
    at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

【问题讨论】:

    标签: apache-spark hive apache-spark-sql orc


    【解决方案1】:

    NullPointerException 始终是一个错误。不在您的代码中,而是在您正在使用的 java 程序中。所以用 apache spark 提交一个 bug。

    【讨论】:

      【解决方案2】:

      我在 Spark 1.6.1 上遇到了同样的错误。还没有找到问题的根源,但是第一个发现是只有一些 hive 分区没有返回数据(尽管它们可以工作并且使用 hive 本身可以很好地返回)。这意味着如果您删除分区过滤器,或者查询另一个表,一切看起来都很好。

      【讨论】:

        【解决方案3】:

        如果不遵循 spark 的目录结构,则会出现此错误。 考虑一个名为 partitionedtable 的表,它在 partitionCol1,partitionCol2,partitionCol3 上进行分区

        hdfs dfs -ls /path/in/hdfs/partitionedtable/

        /path/in/hdfs/partitionedtable/partitionCol1=1/partitionCol2=11/partitionCol3=111/part-00000

        /path/in/hdfs/partitionedtable/partitionCol1=2/partitionCol2=22/partitionCol3=222/part-00001

        /path/in/hdfs/partitionedtable/partitionCol1=3/ --> 这个没有任何数据。

        参考: https://issues.apache.org/jira/browse/SPARK-10304

        【讨论】:

          猜你喜欢
          • 2015-08-27
          • 1970-01-01
          • 2017-08-07
          • 2018-01-31
          • 2019-02-17
          • 1970-01-01
          • 2019-04-17
          • 2020-10-15
          • 1970-01-01
          相关资源
          最近更新 更多