【问题标题】:Getting duplication WITH partition by通过分区获取重复
【发布时间】:2023-03-21 20:29:02
【问题描述】:

我想知道哪些在 2020 年 6 月订购的客户也在 2021 年 6 月订购了。我的代码返回了正确的 DISTINCT 订单,但对于在任一年份下超过一个订单的客户,折扣销售不正确。例如,一位客户在 2020 年下了一个订单,在 2021 年下了四个订单,2020 年的折扣销售额是实际金额的 4 倍。 2021 年的四个订单有四行,一个 2020 年的订单针对每一行进行填充。我通过使用暴露了潜在问题的 ROW_NUMBER () 看到了这一点。我不能将 DISTINCT 与折扣销售一起使用,因为客户确实以相同的美元金额下了多个订单。如何使用 BQ 的标准 SQL 获得确切的折扣销售额?

SELECT 
DISTINCT ly.cuid AS cuid,
COUNT(DISTINCT ly.order_id) OVER (PARTITION BY ly.cuid) AS ly_orders,
SUM(ly.discounted_sales) OVER (PARTITION BY ly.cuid) AS ly_demand,
COUNT(DISTINCT ty.order_id) OVER (PARTITION BY ty.cuid) AS ty_orders,
SUM(ty.discounted_sales) OVER (PARTITION BY ly.cuid) AS ty_demand
    
    FROM table ly

        LEFT JOIN table ty
        ON ly.cuid = ty.cuid

        WHERE ly.order_date BETWEEN '2020-06-01' AND '2020-06-30'
        AND ty.order_date BETWEEN '2021-06-01'AND '2021-06-30'
        AND ly.financial_status <> 'credit'
        AND ty.financial_status <> 'credit'
        AND ly.discounted_sales >0
        AND ty.discounted_sales >0
        AND ly.channel = 'b2b'
        AND ty.channel = 'b2b'
        ORDER BY ly.cuid asc

[结果]

cuid    ly_orders    ly_demand  ty_orders    ty_demand  comments
D       1            22,466.40  4            154,596.24 ly is 4x actual
F       2             2,573.20  1              1,944.40 ty is 2x actual
G       1            32,134.40  4              1,632.00 ly is 4x actual
I       2               757.56  1                730.56 ty is 2x actual
J       2            54,859.00  2             23,822.32 both are 2x actual

这行得通:

WITH prior_period AS (
SELECT 
DISTINCT cuid AS ly_cuid,
COUNT(DISTINCT order_id) OVER (PARTITION BY cuid) AS ly_orders,
SUM(discounted_sales) OVER (PARTITION BY cuid) AS ly_demand
    FROM TABLE 
        WHERE EXTRACT (YEAR FROM order_date) = 2020 AND EXTRACT(MONTH FROM order_date) = 6
        AND financial_status <> 'credit'
        AND discounted_sales >0
        AND channel = 'b2b'
        GROUP BY cuid, order_id, discounted_sales
        ORDER BY cuid asc),

    this_period AS (
    SELECT 
    DISTINCT cuid AS ty_cuid,
    COUNT(DISTINCT order_id) OVER (PARTITION BY cuid) AS ty_orders,
    SUM(discounted_sales) OVER (PARTITION BY cuid) AS ty_demand
    FROM TABLE 
        WHERE EXTRACT (YEAR FROM order_date) = 2021 AND EXTRACT(MONTH FROM order_date) = 6
        AND financial_status <> 'credit'
        AND discounted_sales >0
        AND channel = 'b2b'
        GROUP BY cuid, order_id, discounted_sales
        ORDER BY cuid asc)

        SELECT *
        FROM prior_period ly
        JOIN this_period ty ON ly.ly_cuid = ty.ty_cuid
        ORDER BY ly.ly_cuid 

【问题讨论】:

  • 这将有助于查看数据和预期结果。基本上,您需要先聚合 JOIN 的每一侧,然后加入聚合数据。否则,您的联接将导致您的 SUMs / COUNTs 受到另一个表中的行的影响。
  • 谢谢你,乔恩。我会尽力做到这一点。几周前我刚刚学习了 SQL。我还不允许嵌入照片,当我粘贴结果时,你现在可以看到它是一团糟。
  • 没错。在加入之前进行聚合。但是您遗漏了其他细节,例如,如果您使用 GROUP BY,则不需要 DISTINCT。当您不想将组/分区分别聚合到一行时,窗口函数很好。如果您注意到我的解决方案,我会避免这种情况并使用GROUP BY。使用我的方法可以简化一些。
  • 如果你能提供一些测试数据,那可能会有所帮助。当与窗口函数结合使用时,您对GROUP BY 的使用可能是错误的,或者至少部分是不必要的。

标签: sql google-bigquery case partition


【解决方案1】:

更新了您的架构和近似数据:

试试这个...

WITH periods AS (
      SELECT cuid     AS cuid
           , COUNT(*) AS orders
           , SUM(discounted_sales) AS demand
           , EXTRACT(YEAR FROM order_date) AS yr
        FROM demand2
       WHERE EXTRACT(YEAR FROM order_date) IN (2020, 2021) AND EXTRACT(MONTH FROM order_date) = 6
         AND financial_status <> 'credit'
         AND discounted_sales > 0
         AND channel = 'b2b'
       GROUP BY cuid, EXTRACT(YEAR FROM order_date)
     )
SELECT ly.cuid
     , ly.orders AS ly_orders
     , ly.demand AS ly_demand
     , ty.orders AS ty_orders
     , ty.demand AS ty_demand
  FROM periods AS ly
  JOIN periods AS ty
    ON ly.cuid = ty.cuid
   AND ly.yr = 2020
   AND ty.yr = 2021
 ORDER BY ly.cuid
;

结果:

+------+-----------+-----------+-----------+-----------+
| cuid | ly_orders | ly_demand | ty_orders | ty_demand |
+------+-----------+-----------+-----------+-----------+
| D    |         1 |   5616.60 |         4 | 154596.24 |
| F    |         2 |   2573.20 |         1 |    972.20 |
| G    |         1 |   8033.60 |         4 |   1632.56 |
| I    |         2 |    757.56 |         1 |    365.28 |
| J    |         2 |  27429.50 |         2 |  11911.16 |
+------+-----------+-----------+-----------+-----------+

这是一个类似的示例,其中包含数据、SQL 和结果,以显示不正确的结果和正确的结果。

数据:

SELECT * FROM demand ORDER BY account_id, period;

+----+------------+--------+--------+
| id | account_id | period | demand |
+----+------------+--------+--------+
|  1 |          1 | 202005 |    100 |
|  2 |          1 | 202005 |    120 |
|  3 |          1 | 202105 |    105 |
|  4 |          1 | 202105 |    125 |
|  5 |          1 | 202105 |     30 |
|  6 |          2 | 202005 |    200 |
|  7 |          2 | 202105 |    240 |
+----+------------+--------+--------+

不正确的 SQL,没有 SUMs 仅显示连接行为:

SELECT t1.id, t1.account_id, t1.period, t1.demand AS demand1
     , t2.id, t2.period, t2.demand AS demand2
  FROM      demand AS t1
  LEFT JOIN demand AS t2
    ON t1.account_id = t2.account_id
   AND t2.period = 202105
 WHERE t1.period = 202005
 ORDER BY t1.account_id, t1.period, demand1, demand2
;

+----+------------+--------+---------+------+--------+---------+
| id | account_id | period | demand1 | id   | period | demand2 |
+----+------------+--------+---------+------+--------+---------+
|  1 |          1 | 202005 |     100 |    5 | 202105 |      30 |
|  1 |          1 | 202005 |     100 |    3 | 202105 |     105 |
|  1 |          1 | 202005 |     100 |    4 | 202105 |     125 |
|  2 |          1 | 202005 |     120 |    5 | 202105 |      30 |
|  2 |          1 | 202005 |     120 |    3 | 202105 |     105 |
|  2 |          1 | 202005 |     120 |    4 | 202105 |     125 |
|  6 |          2 | 202005 |     200 |    7 | 202105 |     240 |
+----+------------+--------+---------+------+--------+---------+

注意帐号 2 没有问题,因为去年和今年只找到了一个需求。

但账户 1 找到去年的 2 个需求行和今年的 3 个需求行,导致连接结果中有 (2 x 3) = 6 行。这就是问题的根源。

为了纠正这个问题,我们在 JOIN 之前进行聚合,这样每个帐户/期间最多可以加入一 (1) 行。

正确解决方案的一种形式,SUMs 源自 CTE 术语:

WITH demands AS (
         SELECT account_id, period
              , SUM(demand) AS demand
              , COUNT(*)    AS orders
           FROM demand
          GROUP BY account_id, period
     )
SELECT ly.account_id, ly.period
     , ly.orders AS ly_orders
     , ly.demand AS ly_demand
     , ty.orders AS ty_orders
     , ty.demand AS ty_demand
  FROM      demands AS ly
  LEFT JOIN demands AS ty
    ON ly.account_id = ty.account_id
   AND ty.period = 202105
 WHERE ly.period = 202005
 ORDER BY ly.account_id, ly.period, ly_demand, ty_demand
;

结果:

+------------+--------+-----------+-----------+-----------+-----------+
| account_id | period | ly_orders | ly_demand | ty_orders | ty_demand |
+------------+--------+-----------+-----------+-----------+-----------+
|          1 | 202005 |         2 |       220 |         3 |       260 |
|          2 | 202005 |         1 |       200 |         1 |       240 |
+------------+--------+-----------+-----------+-----------+-----------+

由于我们在 CTE 术语 (demands) 中执行了聚合,因此每个帐户的每个期间最多可以找到一行。

【讨论】:

    猜你喜欢
    • 2014-03-16
    • 2015-08-12
    • 2013-08-10
    • 2018-03-28
    • 2010-10-19
    • 1970-01-01
    • 2013-02-20
    • 2016-08-09
    • 2016-09-20
    相关资源
    最近更新 更多