【问题标题】:Store multiple elements in json files in AWS Athena在 AWS Athena 的 json 文件中存储多个元素
【发布时间】:2017-06-21 09:44:47
【问题描述】:

我有一些 json 文件存储在 S3 存储桶中,其中每个文件都有多个相同结构的元素。例如,

[{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}},{"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}},{"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}]

我想在 Athena 中创建一个与上述数据对应的表。

我为创建表编写的查询:

CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.elb_logs2 (
  `eventId` string,
  `eventName` string,
  `eventVersion` string,
  `eventSource` string,
  `awsRegion` string,
  `image` map<string,string> 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'field.delim' = ' '
) LOCATION 's3://<bucketname>/';

但如果我按如下方式进行 SELECT 查询,

SELECT * FROM sampledb.elb_logs4;

我得到以下结果:

1   {"eventid":"1","eventversion":"1.0","image":{"id":"101","message":"New item!"},"eventsource":"aws:dynamodb","eventname":"INSERT","awsregion":"us-west-2"}   {"eventid":"2","eventversion":"1.0","image":{"id":"101","message":"This item has changed"},"eventsource":"aws:dynamodb","eventname":"MODIFY","awsregion":"us-west-2"}   {"eventid":"3","eventversion":"1.0","image":{"id":"101","message":"This item has changed"},"eventsource":"aws:dynamodb","eventname":"REMOVE","awsregion":"us-west-2"}   

json 文件的全部内容在这里被选为一个条目。

如何将 json 文件的每个元素作为一个条目读取?

编辑:如何读取图像的每个子列,即地图的每个元素?

谢谢。

【问题讨论】:

    标签: sql json amazon-web-services amazon-athena


    【解决方案1】:

    问题1:将多个元素存储在AWS Athena的json文件中

    我需要将我的 json 文件重写为

    {"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2", "image":{"Message":"新项目!","Id":101}}, {"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource ":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"该项目已更改","Id":101}}, {"eventId":" 3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message": "此项目已更改","Id":101}}

    意思是

    去掉方括号 [ ] 让每个元素在一行中

    {.....................}
    {.....................}
    {.....................}
    

    问题2。访问非线性 json 属性

    CREATE EXTERNAL TABLE IF NOT EXISTS <tablename> (
      `eventId` string,
      `eventName` string,
      `eventVersion` string,
      `eventSource` string,
      `awsRegion` string,
      `image` struct <`Id` : string,
                      `Message` : string>
    )
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    WITH SERDEPROPERTIES (
      'serialization.format' = '1',
       "dots.in.keys" = "true"
    ) LOCATION 's3://exampletablewithstream-us-west-2/';
    

    查询:

    select image.Id, image.message from <tablename>;
    

    参考:

    http://engineering.skybettingandgaming.com/2015/01/20/parsing-json-in-hive/

    https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords

    【讨论】:

    • 我和 Q1 有同样的问题,因为我的数据来自 sendgrid,我对数据格式没有太多选择:(
    • 您是否能够告诉 kinesis firehose 在一个 S3 文件中的每个条目后放置一个新行?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-04-06
    • 1970-01-01
    • 2021-02-24
    • 1970-01-01
    • 2020-11-22
    • 2012-10-29
    相关资源
    最近更新 更多