【问题标题】:Writing Spark Dataframe as ORC file throws error将 Spark Dataframe 写入 ORC 文件会引发错误
【发布时间】:2019-01-17 06:51:00
【问题描述】:

我正在尝试将 Spark DF 编写为 ORC 文件,它会引发以下错误。我收到 IndexOutOfBoundsException ..

日志:

Caused by: org.apache.spark.SparkException: Task failed while writing rows
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:270)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        ... 1 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 116, Size: 116
        at java.util.ArrayList.rangeCheck(ArrayList.java:657)
        at java.util.ArrayList.get(ArrayList.java:433)
        at org.apache.hadoop.hive.ql.io.orc.OrcStruct$OrcStructInspector.<init>(OrcStruct.java:196)
        at org.apache.hadoop.hive.ql.io.orc.OrcStruct.createObjectInspector(OrcStruct.java:549)
        at org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:109)
        at org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:188)
        at org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:231)
        at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:91)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.org$apache$spark$sql$execution$datasources$FileFormatWriter$DynamicPartitionWriteTask$$newOutputWriter(FileFormatWriter.scala:416)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:449)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:438)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator.foreach(AbstractScalaRowIterator.scala:26)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:438)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)
        ... 8 more

【问题讨论】:

    标签: pyspark apache-spark-sql orc


    【解决方案1】:

    您能否添加更多关于您如何尝试写入 ORC 的详细信息?

    一般做法是,如果您正在读取具有模式的数据,例如文本格式的配置单元表。您将使用如下的直接 api

    df.write.format(‘orc’).save(‘/tmp/output’)
    

    如果您没有架构,则直接从 hdfs 或流式应用程序读取数据的情况。您必须定义架构并创建数据框。

    spark.read.csv(path, schema)
    Val schema = StructType([
    StructField(‘colName1’, StringType(), false)
    ])
    

    或者,如果您有 RDD,则必须将 RDD[ANY] 转换为 RDD[Row] 行并定义架构并将其转换为数据帧。

    df = spark.convertDataFrame(rdd_of_rows, schema)
    df.write.format('orc').save('/tmp/output')
    

    【讨论】:

    • 小修正:在 PySpark 方面应该是:df.write.format('orc').save(output_path) write is not a method
    猜你喜欢
    • 2021-06-15
    • 2018-10-31
    • 1970-01-01
    • 2020-07-21
    • 1970-01-01
    • 1970-01-01
    • 2018-11-30
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多