AWS Athena DDL 来自 parquet 文件，结构为列答案

【问题标题】：AWS Athena DDL from parquet file with structs as columnsAWS Athena DDL 来自 parquet 文件，结构为列
【发布时间】：2021-10-18 03:41:00
【问题描述】：

我使用胶水爬虫生成了一个 Athena DDL，以从存储在 S3 中的 Parquet 文件创建 AWS Athena 表。但是，在复制 DDL 并在不同的 AWS 账户中使用它时，我收到以下错误：

第 7:25 行：输入“

雅典娜 DDL

CREATE  TABLE x.y(
  "questiontext" string, 
  "dataexporttag" string, 
  "questiontype" string, 
  "selector" string, 
  "subselector" string, 
  "configuration" struct<ChoiceColumnWidth:bigint,MobileFirst:boolean,QuestionDescriptionOption:string,RepeatHeaders:string,TextPosition:string,WhiteSpace:string>, 
  "questiondescription" string, 
  "choices" struct<1:struct<Display:string>,2:struct<Display:string>,3:struct<Display:string>,4:struct<Display:string>,5:struct<Display:string>,6:struct<Display:string>,7:struct<Display:string>,8:struct<Display:string,ExclusiveAnswer:boolean>,9:struct<Display:string>>, 
  "choiceorder" array<bigint>, 
  "validation" struct<Settings:struct<ForceResponse:string,ForceResponseType:string,Type:string>>, 
  "language" array<int>, 
  "nextchoiceid" bigint, 
  "nextanswerid" bigint, 
  "questionid" string, 
  "questiontext_unsafe" string, 
  "variablenaming" struct<1:string,2:string,3:string,4:string,5:string,6:string,7:string,8:string,9:string>, 
  "datavisibility" struct<Hidden:boolean,Private:boolean>, 
  "recodevalues" struct<1:string,2:string,3:string,4:string,5:string,6:string,7:string,8:string,9:string>, 
  "randomization" struct<Advanced:struct<FixedOrder:array<string>,RandomSubSet:array<int>,RandomizeAll:array<string>,TotalRandSubset:bigint,Undisplayed:array<int>>,EvenPresentation:boolean,TotalRandSubset:string,Type:string>, 
  "defaultchoices" boolean, 
  "gradingdata" array<int>, 
  "searchsource" struct<AllowFreeResponse:string>, 
  "displaylogic" struct<0:struct<0:struct<ChoiceLocator:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,QuestionID:string,QuestionIDFromLocator:string,QuestionIsInLoop:string,RightOperand:string,Type:string>,1:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,2:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,3:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,4:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,5:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,6:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,Type:string>,Type:string,inPage:boolean>, 
  "analyzechoices" struct<6:string,8:string>, 
  "answers" struct<1:struct<Display:string>,2:struct<Display:string>,3:struct<Display:string>,4:struct<Display:string>,5:struct<Display:string>,6:struct<Display:string>>, 
  "answerorder" array<bigint>, 
  "choicedataexporttags" boolean)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

我能够使用爬虫生成的表来查询表，并且架构似乎是正确的。谁能帮我理解为什么我不能将 DDL abd 用于不同 AWS 账户中的同一个文件？

【问题讨论】：

标签： amazon-s3 parquet amazon-athena presto

【解决方案1】：

DDL 语句有很多问题。它是如何产生的？我建议使用SHOW CREATE TABLE x 来生成适用于 Athena 的 DDL 语句。

这些是一些问题：

第一行在CREATE 和TABLE 之间缺少EXTERNAL。
在 Athena DDL 中，列名不能用双引号引起来。这有点奇怪，因为这是您在 DML 中引用它们的方式，但是 DDL 由 Hive 解析，而 DML 由 Presto 解析，并且它们具有不同的语法¯\(ツ)/¯。如果您需要在 DDL 中引用列名，则正确的字符是反引号。
结构字段不能以数字开头。这些结构真的有带有数字名称的字段吗？它们实际上是数组吗？

您可能还会在字段名称的大小写方面遇到一些问题，Athena 不区分大小写，这可能会在结构字段中出错，但是 YMMV。

当事情不是标准和基本的时候，Glue 爬虫在生成正确的模式方面是出了名的糟糕。我建议您手动设置表格并使用partition projection。

您可能想知道，当 DDL 不能用于创建另一个表时，Glue 是如何设法创建一个表的。原因是 Glue 爬虫使用 Glue API。他们不会生成 DDL 语句并通过 Athena 运行它。 Glue API 没有强加相同的规则，因为它们旨在支持除 Athena 之外的多种服务，例如 EMR 上的 Spark 和 Hadoop，以及 Redshift Spectrum。不幸的是，仅仅因为 Glue 数据目录中有一个表并不意味着它可以与 Athena 一起使用。

【讨论】：