【问题标题】:cannot load parquet file (Parquet type not supported: INT32 (UINT_8);) with pyspark无法使用 pyspark 加载镶木地板文件(不支持镶木地板类型:INT32 (UINT_8);)
【发布时间】:2021-01-30 14:44:16
【问题描述】:

我正在尝试加载存储在 hadoop 中的 parquet 文件。
这是我的桌子:

name   type
----------------
ID     BIGINT
point  SMALLINT
check  TINYINT

我要执行的是:

df = sqlContext.read.parquet('path')

我得到了这个错误:

Caused by: org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_8);
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotSupported$1(ParquetSchemaConverter.scala:101)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:137)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:89)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter$$anonfun$1.apply(ParquetSchemaConverter.scala:68)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter$$anonfun$1.apply(ParquetSchemaConverter.scala:65)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetToSparkSchemaConverter$$convert(ParquetSchemaConverter.scala:65)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:62)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readSchemaFromFooter$2.apply(ParquetFileFormat.scala:664)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readSchemaFromFooter$2.apply(ParquetFileFormat.scala:664)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:664)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:621)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:603)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

我尝试解决这个问题,发现 spark parquet 不支持某些类型。
那么有没有办法加载我的表?制作新表是唯一的方法吗?因为这个问题,我花了很长时间......

【问题讨论】:

    标签: apache-spark pyspark parquet


    【解决方案1】:

    Spark parquet 不支持 uint 等某些类型。我的表有 uint 类型,所以这就是问题所在。
    我用这个答案解决了这个问题 https://stackoverflow.com/a/62654180/8578220
    首先,制作新的架构:

    from pyspark.sql.types import *        
    newSchema = StructType([ StructField("ID", LongType(), True),
                             StructField("point", IntegerType(), True),
                             StructField("check", IntegerType(), True) ])
    

    并使用此架构打开镶木地板文件

    df = hc.read.option("mergeSchema", "true").schema(newSchema).parquet(path)
    

    它对我有用。

    【讨论】:

    • 嘿@fresh hcdf = hc.read.option("mergeSchema", "true").schema(newSchema).parquet(path) 中代表什么?
    • @SandeepSingh hc 是配置单元上下文。我是这样设计的:sc = SparkContext(conf=sparkConf); hc = HiveContext(self.sc)
    猜你喜欢
    • 2022-06-16
    • 2018-08-13
    • 1970-01-01
    • 2021-11-10
    • 2021-10-14
    • 2017-03-11
    • 2021-03-15
    • 2019-08-04
    • 2015-10-27
    相关资源
    最近更新 更多