【问题标题】:How to extract data from specific fields in a NESTED JSON using AWS Athena - Presto?如何使用 AWS Athena - Presto 从 NESTED JSON 中的特定字段中提取数据?
【发布时间】:2019-07-24 14:52:51
【问题描述】:

我在 S3 存储桶中有以下格式的 JSON,我正在尝试使用 Athena 从“字段”键中仅提取“id”、“标签”和“值”。我尝试了 ARRAY-MAP 但没有成功。此外,在“值”字段上 - 我希望将内容捕获为一个简单的文本,忽略其中的任何列表/字典。

我也不想为这些 JSON 创建任何 Hive 架构并尽可能寻找 Presto SQL 解决方案。

{  
    "reports":{  
        "client":{  
            "pdf":"https://reports.s3-accelerate.amazonaws.com/looks/123/reports/client.pdf",
            "html":"https://api.com/looks/123/reports/client.html"
        },
        "public":{  
            "pdf":"https://s3.amazonaws.com/reports.com/looks/123/reports/public.pdf",
            "html":"https://api.look.com/looks/123/reports/public.html"
        }
    },
    "actors":{  
        "looker":{  
            "firstName":"Rosa",
            "lastName":"Mart"
        },
        "client":{  
            "email":"XXX.XXX@XXXXXX.com",
            "firstName":"XXX",
            "lastName":"XXX"
        }
    },
    "_id":"123",
    "fields":[  
                {  
        "id":"fence_condition_missing_sections",
        "context":[  
            "Fence Condition"
        ],
        "label":"Missing Sections",
        "type":"choice",
        "value":"None"
    },
        {  
            "id":"photos_landscaped_area",
            "context":[  
                "Landscaping Photos"
            ],
            "label":"Landscaped Area",
            "type":"photo-with-description",
            "value":[  
                {  
                    "description":"Front",
                    "photo":"https://reports-wegolook-com.s3-accelerate.amazonaws.com/looks/123/looker/1.jpg"
                },
                {  
                    "description":"Front entrance ",
                    "photo":"https://reports-wegolook-com.s3-accelerate.amazonaws.com/looks/123/looker/2.jpg"
                }
            ]
        }
    ],
    "jobNumber":"xxx",
    "createdAt":"2018-10-11T22:39:37.223Z",
    "completedAt":"2018-01-27T20:13:49.937Z",
    "inspectedAt":"2018-01-21T23:33:48.718Z",
    "type":"ZZZ-commercial",
    "name":"Commercial"
}'

预期输出:

--------------------------------------------------------------------------------
| ID     | LABEL |  VALUE                                                  | 
--------------------------------------------------------------------------------
|   photos_landscaped_area     |  Landscaped Area |  [{"description":"Front",...}]    |
----------------------------------------------------------------------------
| fence_condition_missing_sections | Missing Sections | None|
----------------------------------------------------------------------------

【问题讨论】:

    标签: arrays json amazon-athena presto


    【解决方案1】:

    我将假设您的数据采用每行一个文档的格式,并且为了便于阅读,您提供了一个格式化示例。如果这不正确,请查看问题Multi-line JSON file querying in hive

    当 JSON 文档的架构不完全规则时,您可以将该列创建为 string 列并使用 JSON_* 函数从中提取值。

    首先您需要为原始数据创建一个表:

    CREATE TABLE data (
      fields array<struct<id:string,label:string,value:string>>
    )
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    LOCATION 's3://…'
    

    (如果您对 JSON 文档中的其他字段不感兴趣,可以在创建表格时忽略这些字段)

    然后创建一个扁平化数据的视图:

    CREATE VIEW flat_data AS
    SELECT
      field.id,
      field.label,
      field.value
    FROM data
    CROSS JOIN UNNEST(fields) AS f(field)
    

    从这个视图中选择应该会给你想要的结果。

    我怀疑您也在寻找如何从 values 结构中提取属性,这就是我在上面提到的:

    SELECT
      label,
      JSON_EXTRACT(value, '$.photo') AS photo_urls
    FROM flat_data
    WHERE id = 'photos_landscaped_area'
    

    在 Presto 文档中查看所有可用的 JSON functions

    【讨论】:

      猜你喜欢
      • 2020-01-21
      • 2020-05-16
      • 2021-08-29
      • 2020-01-13
      • 1970-01-01
      • 2019-10-19
      • 2021-10-18
      • 1970-01-01
      • 2020-09-10
      相关资源
      最近更新 更多