调用 from_catalog 时根据名称排除文件答案

【问题标题】：Exclude files based on name when calling from_catalog调用 from_catalog 时根据名称排除文件
【发布时间】：2022-12-19 16:00:36
【问题描述】：

我正在通过读取数据

glueContext.create_data_frame.from_catalog(database = "db", table_name = "ta")

来自 s3 存储桶上的镶木地板文件。不幸的是，该存储桶似乎包含一个非 parquet 文件 (last_ingest_partition)，这会导致以下错误： An error occurred while calling o92.getDataFrame. s3://cdh/measurements/ta/last_ingest_partition is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 45, 49, 50]

是否有可能排除该文件被读取？我试过类似的东西

glueContext.create_data_frame.from_catalog(database = "db", table_name = "ta", additional_options={"exclusions" : "[\"**last_ingest_partition\""})

但这对我不起作用。

【问题讨论】：

标签： pyspark aws-glue

【解决方案1】：

这是我发现的以及解决我的问题的方法：

当我将我的代码切换到 create_dynamic_frame.from_catalog 而不是 create_data_frame.from_catalog 并在之后添加 .toDF() 时，一切对我来说都很好。
对于create_dynamic_frame，我还可以使用排除项作为附加选项：.create_dynamic_frame.from_catalog(database = "testdb1", table_name = "cxexclude",additional_options={"exclusions": "["**{json,parquet}**"]"})
对于create_data_frame class，存在限制：Spark DataFrame 分区过滤不起作用。

【讨论】：