我们可以在同一个镶木地板文件中的每个行组有不同的模式吗？答案

【问题标题】：Can we have different schema per row-group in the same parquet file?我们可以在同一个镶木地板文件中的每个行组有不同的模式吗？
【发布时间】：2019-12-25 03:19:35
【问题描述】：

在创建 parquet 文件时，每个行组可以有不同的架构吗？在这种情况下，页脚将具有所有行组中所有模式的联合，但每个行组的模式将不同。这是公认的镶木地板格式吗？ parquet 规范是否清楚地表明架构不能在同一个 parquet 文件中更改每个行组？

官方规范对这部分不是很具体，但是当我们以这种方式写入文件时，Spark 无法读取。

我尝试编写这样的文件并使用 spark.read.parquet 进行读取，但出现以下错误

// this line works fine and it shows the schema from the footer where we have a unioned schema of all the rowgroups.
var df = spark.read.option("mergeSchema", "true").parquet("abc.parquet") 

// but when I try to do df.show() it throws an error
df.show()

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 86, 10.139.64.6, executor 0): java.lang.IllegalArgumentException: [Visibility_value_string] optional binary Visibility_value_string (UTF8) is not in the store: .....

规范here 仅表示列的顺序应与 FileMetadata 中的顺序相同，我将其解释为，我可以在随后的行组中包含更多列。

规范只说每个行组中的架构必须包含与 FileMetadata 顺序相同的列，但它并没有真正说它应该包含所有列。在这种情况下，我们可以在后续行组中拥有更多列吗？

row group 1 -> col1, col2
row group 2 -> col1, col2, col3
row group 3 -> col1, col2, col3, col4
file metadata -> col1, col2, col3, col4

这是一种可接受的镶木地板格式吗？如果不是，为什么？

【问题讨论】：

标签： apache-spark parquet databricks azure-databricks

【解决方案1】：

单个文件需要在内部保持一致，但是当您有多个文件时，您可以拥有“兼容”但不同的架构。

【讨论】：

您能解释一下您所说的内部一致是什么意思吗？规范here 只说列的顺序应与 FileMetadata 中的顺序相同，我将其解释为，我可以在随后的行组中有更多列，对吧？