【问题标题】:Left join lateral for conditional sums条件和的左连接横向
【发布时间】:2019-07-18 00:50:15
【问题描述】:

我有一个包含客户、产品和类别的购买数据集。

customer     product     category    sales_value
       A     aerosol     air_care             10
       B     aerosol     air_care             12
       C     aerosol     air_care              7
       A     perfume     air_care              8
       A     perfume     air_care              2
       D     perfume     air_care             11
       C      burger         food             13
       D       fries         food              6
       C       fries         food              9

对于每种产品,我想要至少购买该产品一次的客户在该产品上花费的销售价值与在该产品类别上花费的销售价值之间的比率。

另一种说法:以至少购买一次fries 的客户为例,计算 A)在fries 上花费的销售额总和 B)在 @ 上花费的销售额总和987654324@.

中间表的格式如下:

product    category  sum_spent_on_product           sum_spent_on_category    ratio
                                                 by_people_buying_product
aerosol    air_care                    29                              39     0.74
perfume    air_care                    21                              31     0.68
 burger        food                    13                              22     0.59
  fries        food                    15                              28     0.53

示例:人们至少购买过一次aerosol,在该产品上总共花费了 1800。总体而言,同一个人在air_care 类别(aerosol 所属)上花费了 3600。因此,aerosol 的比率为 0.5。

我尝试使用left join lateral 解决此问题并计算每个product 的给定中间结果,但我无法弄清楚如何包含条件only for customers who bought this specific product

select
    distinct (product_id)
  , category
  , c.sales_category
from transactions t
left join lateral (
  select
    sum(sales_value) as sales_category
  from transactions
  where category = t.category
  group by category
) c on true
;

上面的查询列出了每个产品在产品类别上的花费总和,但没有所需的产品购买者条件。

left join lateral 是正确的方法吗?普通 SQL 中还有其他解决方案吗?

【问题讨论】:

  • 我看不出你的第二个样本数据表与第一个有什么关系。

标签: sql postgresql lateral-join


【解决方案1】:

我会使用一个窗口函数来计算每个客户在每个类别中的总支出:

SELECT
  customer, product, category, sales_value,
  sum(sales_value) OVER (PARTITION BY customer, category) AS tot_cat
FROM transactions;

 customer | product | category | sales_value | tot_cat 
----------+---------+----------+-------------+---------
 A        | aerosol | air_care |       10.00 |   20.00
 A        | perfume | air_care |        8.00 |   20.00
 A        | perfume | air_care |        2.00 |   20.00
 B        | aerosol | air_care |       12.00 |   12.00
 C        | aerosol | air_care |        7.00 |    7.00
 C        | fries   | food     |        9.00 |   22.00
 C        | burger  | food     |       13.00 |   22.00
 D        | perfume | air_care |       11.00 |   11.00
 D        | fries   | food     |        6.00 |    6.00

那我们只需要总结一下。当客户多次购买相同的产品时,就会出现问题。在您的示例中,客户A 购买了两次香水。为了克服这个问题,让我们同时按客户、产品和类别进行分组(并对sales_value 列求和):

SELECT
  customer, product, category, SUM(sales_value) AS sales_value,
  SUM(SUM(sales_value)) OVER (PARTITION BY customer, category) AS tot_cat
FROM transactions
GROUP BY customer, product, category

 customer | product | category | sales_value | tot_cat 
----------+---------+----------+-------------+---------
 A        | aerosol | air_care |       10.00 |   20.00
 A        | perfume | air_care |       10.00 |   20.00 <-- this row summarizes rows 2 and 3 of previous result
 B        | aerosol | air_care |       12.00 |   12.00
 C        | aerosol | air_care |        7.00 |    7.00
 C        | burger  | food     |       13.00 |   22.00
 C        | fries   | food     |        9.00 |   22.00
 D        | perfume | air_care |       11.00 |   11.00
 D        | fries   | food     |        6.00 |    6.00

现在我们只需将 sales_value 和 tot_cat 相加即可得到中间结果表。我使用公用表表达式来获取名称t下的先前结果:

WITH t AS (
  SELECT
    customer, product, category, SUM(sales_value) AS sales_value,
    SUM(SUM(sales_value)) OVER (PARTITION BY customer, category) AS tot_cat
  FROM transactions
  GROUP BY customer, product, category
)
SELECT
  product, category,
  sum(sales_value) AS sales_value, sum(tot_cat) AS tot_cat,
  sum(sales_value) / sum(tot_cat) AS ratio
FROM t
GROUP BY product, category;

 product | category | sales_value | tot_cat |         ratio          
---------+----------+-------------+---------+------------------------
 aerosol | air_care |       29.00 |   39.00 | 0.74358974358974358974
 fries   | food     |       15.00 |   28.00 | 0.53571428571428571429
 burger  | food     |       13.00 |   22.00 | 0.59090909090909090909
 perfume | air_care |       21.00 |   31.00 | 0.67741935483870967742

【讨论】:

  • 您的解决方案实际上比lateral join 解决方案快一百万倍,特别是如果沿途有多个复杂的where 子句。
  • 很好,我不知道这个解决方案的性能。感谢您的反馈!
【解决方案2】:

对于每种产品,我想要至少购买该产品一次的客户在该产品上花费的销售价值与在该产品类别上花费的销售价值之间的比率。

如果我理解正确,您可以按人员和类别汇总销售额以获得该类别的总数。在 Postgres 中,您可以保留一系列产品并将其用于匹配。所以,查询看起来像:

select p.product, p.category,
       sum(p.sales_value) as product_only_sales, 
       sum(pp.sales_value) as comparable_sales
from purchases p join
     (select customer, category, array_agg(distinct product) as products, sum(sales_value) as sales_value
      from purchases p
      group by customer, category
     ) pp
     on p.customer = pp.customer and p.category = pp.category and p.product = any (pp.products)
group by p.product, p.category;

Here 是一个 dbfiddle。

编辑:

数据允许产品的日期重复。这会让事情变得很糟糕。解决方案是为每个客户按产品预先聚合:

select p.product, p.category, sum(p.sales_value) as product_only_sales, sum(pp.sales_value) as comparable_sales
from (select customer, category, product, sum(sales_value) as sales_value
      from purchases p
      group by customer, category, product
     ) p join
     (select customer, category, array_agg(distinct product) as products, sum(sales_value) as sales_value
      from purchases p
      group by customer, category
     ) pp
     on p.customer = pp.customer and p.category = pp.category and p.product = any (pp.products)
group by p.product, p.category

Here 是这个例子的 dbfiddle。

【讨论】:

  • 这真的很有趣。然而,对于真实数据(大约 250 万行),计算大约需要 100 秒,即使我们只关注一小部分产品。我想知道是否有一种方法可以聚合 pp 不是按客户->产品,而是反过来以提高性能——这种聚合的顺序可能应该由我们想要的过滤器的性质决定(比如,我们想要这个对于一组特定的产品或一组特定的客户)。无论如何,谢谢,这个想法很有帮助。
  • @Jivan 。 . .这似乎是很长一段时间。带有数组的子查询需要多长时间? customer, category, product 上的索引可能会很有帮助。
  • @unutbu 。 . .接得好。这是由于表中的客户和单个产品重复造成的。
猜你喜欢
  • 2015-02-14
  • 2020-11-07
  • 2020-12-03
  • 2022-11-10
  • 2012-02-27
  • 2015-07-18
  • 2015-05-13
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多