【问题标题】:How to Get First_Value(), Last_Value() and previous Date action for an Array object inside a VARIANT column SnowflakeSQL如何获取 VARIANT 列 SnowflakeSQL 中 Array 对象的 First_Value()、Last_Value() 和上一个 Date 操作
【发布时间】:2020-10-18 06:39:07
【问题描述】:

我在表 'QWERTY' 中有一个 VARIANT 列调用 'REQUEST',其中包含类似 JSON 的数组对象

{
"ID": "123123",
"workflowHistory": [
                   {
                    "id": "666",
                    "workflowType": "CCC",
                    "entityId": "123123",
                    "creator": {
                        "id": "503081",
                        "displayName": "AGENT2",
                        "email": "AGENT2@SOMETHING.com",
                        "userAvatarUrl": "XXXXXXX"
                    },
                    "createdDate": "2020-04-30T21:58:09Z",
                    "deletor": null,
                    "deletedDate": null,
                    "clientId": "000000000",
                    "value": "00000000"
                },
                {
                    "id": "555",
                    "workflowType": "AAA",
                    "entityId": "123123",
                    "creator": {
                        "id": "503080",
                        "displayName": "AGENT1",
                        "email": "AGENT1@SOMETHING.com",
                        "userAvatarUrl": "XXXXXXX"
                    },
                    "createdDate": "2020-04-30T21:55:09Z",
                    "deletor": null,
                    "deletedDate": null,
                    "clientId": "000000000",
                    "value": "00000000"
                },
                {
                   "id": "444",
                    "workflowType": "xyz",
                    "entityId": "123123",
                    "creator": {
                        "id": "503080",
                        "displayName": "AGENT1",
                        "email": "AGENT1@SOMETHING.com",
                        "userAvatarUrl": "XXXXXXX"
                    },
                    "createdDate": "2020-04-30T21:19:09Z",
                    "deletor": null,
                    "deletedDate": null,
                    "clientId": "000000000",
                    "value": "00000000"
                },
                {
                   "id": "333",
                    "workflowType": "BBB",
                    "entityId": "123123",
                    "creator": {
                        "id": "503079",
                        "displayName": "AGENT0",
                        "email": "AGENT0@SOMETHING.com",
                        "userAvatarUrl": "XXXXXXX"
                    },
                    "createdDate": "2020-04-30T21:10:09Z",
                    "deletor": null,
                    "deletedDate": null,
                    "clientId": "000000000",
                    "value": "00000000"
                },
                {
                   "id": "222",
                    "workflowType": "ZZZ",
                    "entityId": "123123",
                    "creator": {
                        "id": "503079",
                        "displayName": "AGENT0",
                        "email": "AGENT0@SOMETHING.com",
                        "userAvatarUrl": "XXXXXXX"
                    },
                    "createdDate": "2020-04-30T21:08:09Z",
                    "deletor": null,
                    "deletedDate": null,
                    "clientId": "000000000",
                    "value": "00000000"
                }
                    ]
}

另外,'QWERTY' 表有 HAVERST_DATE 和 PK ARTICLE_ID(与 REQUEST:workflowHistory.ID 相同),我正在尝试获取包含以下列的输出:

  1. 身份证
  2. AGENTn 的上次创建日期
  3. AGENTn 的首次创建日期
  4. 上一个由 AGENTn-1 创建的 createdDate
  5. 下一个由 AGENTn+1 创建的 createdDate

我想要这样的输出:

OUTPUT

为此,我正在构建如下查询:

使用 WorkFlow_Parsed AS(

SELECT ARTICLE_ID,
       HARVEST_DATE,
       value:createdDate::timestamp_tz  AS create_date,                
       value:creator:email AS email,
       value:workflowType AS  workflowType,
       value:value AS value
      
FROM 'QWERTY', lateral flatten( input => REQUEST:workflowHistory )
),


lag_Agent_timing AS 
(SELECT
WorkFlow_Parsed.ARTICLE_ID AS ARTICLE_ID,WorkFlow_Parsed.email,LAG(WorkFlow_Parsed.create_date) IGNORE NULLS over (partition by  WorkFlow_Parsed.email,WorkFlow_Parsed.ARTICLE_ID order by WorkFlow_Parsed.create_date) AS lag_date_value
FROM  WorkFlow_Parsed),

lead_agent_timing AS
(SELECT
WorkFlow_Parsed.ARTICLE_ID AS ARTICLE_ID,WorkFlow_Parsed.email,LEAD(WorkFlow_Parsed.create_date) IGNORE NULLS over (partition by WorkFlow_Parsed.email,WorkFlow_Parsed.ARTICLE_ID order by WorkFlow_Parsed.create_date)  AS lead_date_value
FROM  WorkFlow_Parsed)


SELECT 
DISTINCT 
WorkFlow_Parsed.ARTICLE_ID AS _ARTICLE_ID,
WorkFlow_Parsed.email AS _email,
last_value(WorkFlow_Parsed.create_date) over (partition by WorkFlow_Parsed.email,WorkFlow_Parsed.ARTICLE_ID order by WorkFlow_Parsed.create_date) AS last_date_value,
first_value(WorkFlow_Parsed.create_date) over (partition by WorkFlow_Parsed.email,WorkFlow_Parsed.ARTICLE_ID order by WorkFlow_Parsed.create_date) AS first_date_value,
MAX(lag_Agent_timing.lag_date_value),
MIN(lead_agent_timing.lead_date_value)
FROM  WorkFlow_Parsed
JOIN lag_Agent_timing ON WorkFlow_Parsed.ARTICLE_ID=lag_Agent_timing.ARTICLE_ID AND lag_Agent_timing.email=WorkFlow_Parsed.email
JOIN lead_agent_timing ON WorkFlow_Parsed.ARTICLE_ID=lead_agent_timing.ARTICLE_ID AND lead_agent_timing.email=WorkFlow_Parsed.email  
GROUP BY _ARTICLE_ID,_email

但我收到错误消息:“[SYS_VW.CREATE_DATE_1] 不是有效的按表达式分组”`

我该如何解决?

【问题讨论】:

    标签: sql json snowflake-cloud-data-platform lag flatten


    【解决方案1】:

    [SYS_VW.CREATE_DATE_1] 不是按表达式划分的有效组

    错误来自您在最终的SELECT 查询中使用GROUP BY。它指出您在查询中引用/使用Workflow_Parsed.create_date 作为非组列,但它不是GROUP BY _ARTICLE_ID, _email 表达式的一部分,即它与[Workflow_Parsed.create_date] is not a valid group by expression 相同,如果您收到稍微简化一下查询。

    Snowflake 不允许使用 aggregating over a window function expression,如果您想使用 mix a GROUP BY with a window function,请尝试将查询嵌套在 SELECT cols, aggregate(cols) FROM (SELECT cols, window(cols)) GROUP BY cols 等结构中以将两者分开(即首先将窗口函数应用于所有行,然后将它产生的全部结果)。

    我不确定窗口函数在您的示例查询中尝试什么,因为我在其中的任何地方都看不到代理的 n ± 1 关系,但是按照您描述的要求和包含的示例输出,以下应该可以工作(它只使用标量子查询,没有窗口函数):

    WITH workflows AS (
      SELECT PARSE_JSON('{"ID":"123123","workflowHistory":[{"id":"666","workflowType":"CCC","entityId":"123123","creator":{"id":"503081","displayName":"AGENT2","email":"AGENT2@SOMETHING.com","userAvatarUrl":"XXXXXXX"},"createdDate":"2020-04-30T21:58:09Z","deletor":null,"deletedDate":null,"clientId":"000000000","value":"00000000"},{"id":"555","workflowType":"AAA","entityId":"123123","creator":{"id":"503080","displayName":"AGENT1","email":"AGENT1@SOMETHING.com","userAvatarUrl":"XXXXXXX"},"createdDate":"2020-04-30T21:55:09Z","deletor":null,"deletedDate":null,"clientId":"000000000","value":"00000000"},{"id":"444","workflowType":"xyz","entityId":"123123","creator":{"id":"503080","displayName":"AGENT1","email":"AGENT1@SOMETHING.com","userAvatarUrl":"XXXXXXX"},"createdDate":"2020-04-30T21:19:09Z","deletor":null,"deletedDate":null,"clientId":"000000000","value":"00000000"},{"id":"333","workflowType":"BBB","entityId":"123123","creator":{"id":"503079","displayName":"AGENT0","email":"AGENT0@SOMETHING.com","userAvatarUrl":"XXXXXXX"},"createdDate":"2020-04-30T21:10:09Z","deletor":null,"deletedDate":null,"clientId":"000000000","value":"00000000"},{"id":"222","workflowType":"ZZZ","entityId":"123123","creator":{"id":"503079","displayName":"AGENT0","email":"AGENT0@SOMETHING.com","userAvatarUrl":"XXXXXXX"},"createdDate":"2020-04-30T21:08:09Z","deletor":null,"deletedDate":null,"clientId":"000000000","value":"00000000"}]}') AS request
    ), workflow_rows AS (
      SELECT
        w.request:ID::varchar AS article_id,        
        lf.value:createdDate::timestamp_tz  AS created_date,
        lf.value:creator.id::integer AS creator_id,
        lf.value:creator.email::varchar AS creator_email,
        lf.value:workflowType::varchar AS workflow_type,
        lf.value:value::varchar AS workflow_value
      FROM workflows w, LATERAL FLATTEN(REQUEST:workflowHistory) lf
    ), article_workflow_creators AS (
      SELECT DISTINCT
        article_id,
        creator_id,
        creator_email
      FROM workflow_rows
    )
    SELECT
        awc.article_id,
        awc.creator_id,
        awc.creator_email,
        (SELECT MAX(wr.created_date) FROM workflow_rows wr WHERE wr.article_id = awc.article_id AND wr.creator_id = awc.creator_id) AS last_date_value,
        (SELECT MIN(wr.created_date) FROM workflow_rows wr WHERE wr.article_id = awc.article_id AND wr.creator_id = awc.creator_id) AS first_date_value,
        (SELECT MAX(wr.created_date) FROM workflow_rows wr WHERE wr.article_id = awc.article_id AND wr.creator_id = awc.creator_id - 1) AS previous_date,
        (SELECT MAX(wr.created_date) FROM workflow_rows wr WHERE wr.article_id = awc.article_id AND wr.creator_id = awc.creator_id + 1) AS next_date
    FROM article_workflow_creators awc;
    

    对于问题中包含的单个 JSON 行输入,这会产生:

    +------------+------------+----------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
    | ARTICLE_ID | CREATOR_ID | CREATOR_EMAIL        | LAST_DATE_VALUE               | FIRST_DATE_VALUE              | PREVIOUS_DATE                 | NEXT_DATE                     |
    |------------+------------+----------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------|
    | 123123     |     503081 | AGENT2@SOMETHING.com | 2020-04-30 21:58:09.000 +0000 | 2020-04-30 21:58:09.000 +0000 | 2020-04-30 21:55:09.000 +0000 | NULL                          |
    | 123123     |     503080 | AGENT1@SOMETHING.com | 2020-04-30 21:55:09.000 +0000 | 2020-04-30 21:19:09.000 +0000 | 2020-04-30 21:10:09.000 +0000 | 2020-04-30 21:58:09.000 +0000 |
    | 123123     |     503079 | AGENT0@SOMETHING.com | 2020-04-30 21:10:09.000 +0000 | 2020-04-30 21:08:09.000 +0000 | NULL                          | 2020-04-30 21:55:09.000 +0000 |
    +------------+------------+----------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
    

    【讨论】:

    • 谢谢!!我对 n ± 1 代理不够清楚,(在真实数据集中 CREATOR_ID 不是连续的)我试图表达的意思是,如何获得第一个值、Last_Value、MAX Lead 和 MIN(Lag)。但是我尝试了你建议的sintax,它奏效了。 SELECT cols, aggregate(cols) FROM (SELECT cols, window(cols)) GROUP BY cols
    【解决方案2】:

    我分享了如何使用推荐语法的代码

    WITH WorkFlow_Parsed AS(
      
    SELECT ARTICLE_ID,
           HARVEST_DATE,
           value:createdDate::timestamp_tz  AS create_date,                
           value:creator:email AS email,
           value:workflowType AS  workflowType,
           value:value AS value
          
    FROM 'QWERTY', lateral flatten( input => REQUEST:workflowHistory )
    )
    
    SELECT _ARTICLE_ID, _email, last_date_value,first_date_value,
    MIN(lag_value),
    MAX(lead_value)
    FROM (
    SELECT 
    DISTINCT 
    WorkFlow_Parsed.ARTICLE_ID AS _ARTICLE_ID,
    WorkFlow_Parsed.email AS _email,
    last_value(WorkFlow_Parsed.create_date) over (partition by WorkFlow_Parsed.email,WorkFlow_Parsed.ARTICLE_ID order by WorkFlow_Parsed.create_date) AS last_date_value,
    first_value(WorkFlow_Parsed.create_date) over (partition by WorkFlow_Parsed.email,WorkFlow_Parsed.ARTICLE_ID order by WorkFlow_Parsed.create_date) AS first_date_value,
    COALESCE(LAG(WorkFlow_Parsed.create_date) IGNORE NULLS over (partition by  WorkFlow_Parsed.ARTICLE_ID order by WorkFlow_Parsed.create_date),'1900-01-01 00:00:00') AS lag_value,
    COALESCE(LEAD(WorkFlow_Parsed.create_date) IGNORE NULLS over (partition by WorkFlow_Parsed.ARTICLE_ID order by WorkFlow_Parsed.create_date),'2100-01-01 00:00:00') AS lead_value
    FROM  WorkFlow_Parsed) GROUP BY _ARTICLE_ID,_email,last_date_value,first_date_value
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-09-08
      • 1970-01-01
      • 1970-01-01
      • 2015-06-17
      • 2020-08-15
      • 2013-03-01
      相关资源
      最近更新 更多