SQL：在分组之前排序（或加入替代）而不重复子查询答案

【问题标题】：SQL: order by before group by (or join alternative) without repeating subquerySQL：在分组之前排序（或加入替代）而不重复子查询
【发布时间】：2019-11-14 10:45:00
【问题描述】：

我正在使用 SQL 创建一个步骤漏斗报告。

它返回如下行：

delivered_email,anonymous_id,opened_email,step1_delivered,step2_opened,step3_landing_page,step4_cta_clicked,steps_completed
email1@example.com,,,true,false,false,false,1
email2@example.com,id2,email2@example.com,true,true,true,true,4
email2@example.com,id3,email2@example.com,true,true,false,false,2

同一个电子邮件地址有多个条目，因为这些人参与了多个会话。但是，在这种情况下，我只对完成最多步骤的每个人的会话感兴趣。例如。上述情况下的实际结果应该是 2 行而不是 3 行，其中对于 email2@example.com 仅返回 steps_completed = 4 的情况：

delivered_email,anonymous_id,opened_email,step1_delivered,step2_opened,step3_landing_page,step4_cta_clicked,steps_completed
email1@example.com,,,true,false,false,false,1
email2@example.com,id2,email2@example.com,true,true,true,true,4

通常可以通过将结果与每个用户的max(steps_completed) 连接起来，作为described on Stackoverflow。但是，在我的情况下，steps_completed 列实际上是作为另一个子查询的一部分计算的。因此，在其上创建连接需要我复制粘贴整个子查询，而这将无法维护。

这是查询：

select
  *

from
(
  -- Counts for each sesssion how many steps were completed
  -- This can be used to only select the session with the most steps completed for each unique email address
  select
    *,
    if(step1_delivered, 1, 0) +
    if(step2_opened, 1, 0) +
    if(step3_landing_page, 1, 0) +
    if(step4_cta_clicked, 1, 0)
      as steps_completed

  from
  (
    -- Below subquery combines email addresses with associated anonymous_ids
    -- Note that a single email may have multiple entries here if they used multiple devices
    -- In the rest of the funnel we are interested only in the case grouped by email with the most steps completed
    select
      t_delivered.email as delivered_email,
      t_identifies.id as anonymous_id,
      t_opened.email as opened_email,
      t_delivered.email is not null as step1_delivered,
      coalesce(t_opened.email, t_identifies.id) is not null as step2_opened,
      t_landing_page.id is not null as step3_landing_page,
      t_cta_clicked.id is not null as step4_cta_clicked

    -- Step 1: Retrieve emails to which opener was sent
    from
    (
      select context_traits_email as email

      from drip.email_delivered

      where email_subject like '%you are invited%'

      group by email
    ) as t_delivered

    -- Retrieve the anonymous_id for each email, if set (i.e. if identified)
    -- Note that if we have identified a user we will assume they have opened the email
    left join
    (
      select
        email,
        anonymous_id as id

      from javascript.identifies

      group by email, anonymous_id
    ) as t_identifies

    on t_identifies.email = t_delivered.email

    -- Step 2: retrieve which email addresses opened the opener email
    left join
    (
      select context_traits_email as email
      from drip.email_opened
      group by email
    ) as t_opened

    on t_opened.email = t_delivered.email

    -- Step 3: landing page visited
    left join
    (
      select anonymous_id as id
      from javascript.pages
      where context_page_title = 'Homepage'
      group by anonymous_id
    ) as t_landing_page

    on t_landing_page.id = t_identifies.id

    -- Step 4: CTA clicked
    left join
    (
      select anonymous_id as id
      from javascript.dtc_file_selection_initiated
      group by anonymous_id
    ) as t_cta_clicked

    on t_cta_clicked.id = t_identifies.id
  )
)

我如何将这个结果按delivered_email 分组，而结果（分组前）按steps_completed (desc) 排序而不重复我的子查询？

【问题讨论】：

你能把你的子查询变成一个视图，然后加入到需要的视图中吗？
@alexherm 有效，尽管它需要我保持一个单独的视图——你认为这是唯一的方法吗？
我确信还有另一种方法。但如果这行得通，那就去吧。维护是指设置一次还是需要定期更新？
@alexherm 我需要定期更新它。那里有一些子查询可以查询用户细分等，我需要根据我感兴趣的细分进行更改
minimal reproducible example 请。 PS你的问题是什么？ “在结果（分组前）按steps_completed排序时，通过delivered_email对结果进行分组”是什么意思？表格没有顺序，所以在 group by 之前排序没有限制/顶部没有效果，但这显然不是你想要的，所以你想要什么？使用足够的单词和对部分示例的引用。当不介绍或总结全部细节时，“基本上”也只是意味着“不清楚”。 PS也不清楚这篇文章和代码与“另一个子查询”和避免“加入它”有什么关系。 PS 通过编辑而不是 cmets 进行澄清。

标签： sql join google-bigquery

【解决方案1】：

你应该尝试使用CTE (aka "with clause")和numbering window functions

with

t_delivered as (
    select distinct
        context_traits_email as email
    from
        drip.email_delivered
    where
        email_subject like '%you are invited%'
),

t_identifies as (
    -- Retrieve the anonymous_id for each email, if set (i.e. if identified)
    -- Note that if we have identified a user we will assume they have opened the email
    select distinct
        email,
        anonymous_id as id
    from
        `javascript.identifies`
),

t_opened as (
    -- Step 2: retrieve which email addresses opened the opener email
    select distinct
        context_traits_email as email
    from
        `drip.email_opened`
),

t_landing_page as (
    -- Step 3: landing page visited
    select distinct
        anonymous_id as id
    from
        `javascript.pages`
    where
        context_page_title = 'Homepage'
),

t_cta_clicked as (
    -- Step 4: CTA clicked
    select distinct
        anonymous_id as id
    from
        `javascript.dtc_file_selection_initiated`
),

total_data as (
    -- Below subquery combines email addresses with associated anonymous_ids
    -- Note that a single email may have multiple entries here if they used multiple devices
    -- In the rest of the funnel we are interested only in the case grouped by email with the most steps completed
    select
        td.email as delivered_email,
        ti.id as anonymous_id,
        to.email as opened_email,
        td.email is not null as step1_delivered,
        coalesce(ti.id, to.email) is not null as step2_opened,
        tlp.id is not null as step3_landing_page,
        tcc.id is not null as step4_cta_clicked
    from
        t_delivered as td
        left join t_identifies as ti on td.email = ti.email
        left join t_opened as to on td.email = to.email
        left join t_landing_page as tlp on ti.id = tlp.id
        left join t_cta_clicked as tcc on ti.id = tcc.id
)

select
    *
from
    -- Counts for each session how many steps were completed
    -- This can be used to only select the session with the most steps completed for each unique email address
    (   select
            *,
            row_number() over(  partition by
                                    delivered_email
                                order by  -- prioritize columns here
                                    steps_completed desc,
                                    step4_cta_clicked desc,
                                    step3_landing_page desc,
                                    step2_opened desc,
                                    step1_delivered desc,
                                    anonymous_id) as rn
        from
            (   select
                    *,
                    if(step1_delivered, 1, 0)
                    + if(step2_opened, 1, 0)
                    + if(step3_landing_page, 1, 0)
                    + if(step4_cta_clicked, 1, 0) as steps_completed
                from
                    total_data
                )
        )
where
    rn = 1

【讨论】：