多个汇总的解决方法答案

【问题标题】：Workaround for multiple rollups多个汇总的解决方法
【发布时间】：2021-12-18 21:45:32
【问题描述】：

有没有办法在 BigQuery 中完成以下任务？在 Postgres 等 DB 中支持此语法：

SELECT ProductGroup, Product, Year, Month, AVG(Revenue) 
FROM Sales
group by rollup(ProductGroup, Product), rollup(Year, Month)

换句话说，我想要两个汇总的叉积：

ROLLUP(ProductGroup, Product) --> (), (ProductGroup), (ProductGroup, Product)
ROLLUP(Year, Month) --> (), (Year), (Year, Month)

((), (ProductGroup), (ProductGroup, Product)) x ((), (Year), (Year, Month))
= (
    (), (ProductGroup), (ProductGroup, Product),
    (Year), (Year, ProductGroup), (Year, ProductGroup, Product).
     (Year, Month), (Year, Month, ProductGroup), (Year, Month, ProductGroup, Product)
)

在 BQ 中尝试时出现以下错误：

GROUP BY 子句仅在 [2:10] 处没有其他分组元素时才支持 ROLLUP

这里有一些示例图片和数据的更新。

首先，我想复制 Excel 数据透视表的功能。这就是 ROWS 和 COLS 汇总的叉积发挥作用的地方：

请注意，数据透视表有 63 个值单元格。

现在，正确的 SQL 语法如下在冗长的 GROUP BY-only 语法中：

请注意，这也正好产生 63 行（因为我们有一个值列 -- SUM 收入 -- 63 行 x 1 列 = 63 个值单元格）。查询如下：

with sales as (
    select 2010 Year, 'Jan' Month, 'Electronics' ProductGroup, 'Phone' Product, 100 Revenue union all
    select 2010,    'Jan',  'Electronics',  'Laptop',   200 union all
    select 2010,    'Jan',  'Cars', 'Jeep', 250 union all
    select 2010,    'Jan',  'Cars', 'Hummer',   105 union all
    select 2010,    'Feb',  'Electronics',  'Phone',    110 union all
    select 2010,    'Feb',  'Electronics',  'Laptop',   300 union all
    select 2010,    'Feb',  'Cars', 'Jeep', 50 union all
    select 2010,    'Feb',  'Cars', 'Hummer',   75 union all
    select 2010,    'Mar',  'Electronics',  'Phone',    80 union all
    select 2010,    'Mar',  'Electronics',  'Laptop',   200 union all
    select 2010,    'Mar',  'Cars', 'Jeep', 100 union all
    select 2010,    'Mar',  'Cars', 'Hummer',   50 union all
    select 2011,    'Jan',  'Electronics',  'Phone',    200 union all
    select 2011,    'Jan',  'Electronics',  'Laptop',   300 union all
    select 2011,    'Jan',  'Cars', 'Jeep', 100 union all
    select 2011,    'Jan',  'Cars', 'Hummer',   200 union all
    select 2011,    'Feb',  'Electronics',  'Phone',    300 union all
    select 2011,    'Feb',  'Electronics',  'Laptop',   900 union all
    select 2011,    'Feb',  'Cars', 'Jeep', 100 union all
    select 2011,    'Feb',  'Cars', 'Hummer',   200 union all
    select 2011,    'Mar',  'Electronics',  'Phone',    400 union all
    select 2011,    'Mar',  'Electronics',  'Laptop',   350 union all
    select 2011,    'Mar',  'Cars', 'Jeep', 240 union all
    select 2011,    'Mar',  'Cars', 'Hummer',   130
)
-- ROLLUP(ProductGroup, Product), ROLLUP(Year, Month)
--> (), (ProductGroup), (ProductGroup, Product)
--> (Year), (Year, ProductGroup), (Year, ProductGroup, Product)
--> (Year, Month), (Year, Month, ProductGroup), (Year, Month, ProductGroup, Product)

SELECT NULL, NULL, NULL, NULL, AVG(Revenue) FROM Sales UNION ALL                                                -- ()
SELECT ProductGroup, NULL, NULL, NULL, AVG(Revenue) FROM Sales GROUP BY ProductGroup UNION ALL                  -- (ProductGroup)
SELECT ProductGroup, Product, NULL, NULL, AVG(Revenue) FROM Sales GROUP BY ProductGroup, Product UNION ALL      -- (ProductGroup, Product)

SELECT NULL, NULL, Year, NULL, AVG(Revenue) FROM Sales GROUP BY Year UNION ALL                                  -- (Year)
SELECT ProductGroup, NULL, Year, NULL, AVG(Revenue) FROM Sales GROUP BY Year, ProductGroup UNION ALL            -- (Year, ProductGroup)
SELECT ProductGroup, Product, Year, NULL, AVG(Revenue) FROM Sales GROUP BY Year, ProductGroup, Product UNION ALL-- (Year, ProductGroup, Product)

SELECT NULL, NULL, Year, Month, AVG(Revenue) FROM Sales GROUP BY Year, Month UNION ALL                          -- (Year, Month)
SELECT ProductGroup, NULL, Year, Month, AVG(Revenue) FROM Sales GROUP BY ProductGroup, Year, Month UNION ALL    -- (ProductGroup, Year, Month)
SELECT ProductGroup, Product, Year, Month, AVG(Revenue) FROM Sales GROUP BY ProductGroup, Product, Year, Month  -- (ProductGroup, Product, Year Month)

然而，这个查询对于产品来说真的是一场噩梦——即使是通过程序生成的——因为可能存在order by、subselect、...等，并且将所有这些语句联合起来可能会变成一个可怕的结构（例如，一个 3 行 x 3 列的结构和一个 100 行的 SQL 语句将变成 4^2 * 100 行 sql，而 5x5 将是 5^2 * 100 行，等等。如果我的数学是正确的)。

那么，这样做的正确方法是什么？请注意，在像 Postgres 这样的数据库中，以下内容按原样工作：

SELECT ProductGroup, Product, Year, Month, AVG(Revenue) FROM Sales GROUP BY ROLLUP(ProductGroup, Product), ROLLUP(Year, Month);

如果您想以此为起点，这里是保存的查询：https://console.cloud.google.com/bigquery?sq=260144861653:552549d2a81a47b59df6e3d16ef9bf17。

最后，如果您认为在GROUPING SETS 和CUBE 中添加一个有用的功能，请支持此功能请求：https://issuetracker.google.com/issues/204913323。

【问题讨论】：

1) 提供样本数据和所需的输出。 2) 还向我们展示您收到错误的整个查询
但我不认为像 GROUP BY a, ROLLUP(b) 这样的语法在 bigQuery 中可用
我可能错了，但是-您提出的查询太抽象了，没有太多意义，因此很难提供帮助。您可以对其进行调整以使其实用 - 就像 group by 中的所有字段都显示在 select 中，也许还有一些聚合。理想情况下，如果您提供输入数据和预期输出的简化示例：o)
这不是给出一个联合而不是一个叉积吗？
@shawnt00 - 很抱歉，如果在这篇文章的不同方面来回出现一些误解：o)

标签： sql google-bigquery combinatorics rollup

【解决方案1】：

我想要两个汇总的叉积：

考虑下面

select * from (
select date, code 
from `first-outlet-750.tests.parq_stored`
group by rollup(date, code)
), (
select country, state 
from `first-outlet-750.tests.parq_stored`
group by rollup(country, state))

输出如下

【讨论】：

请查看更新后的问题以及示例和所有内容。
我当前的答案是否解决了您最初的问题，以便您可以将其应用于您的特定用例？
它给出了正确的行数，但我将如何传递 AVG(Revenue) ？
这里是查询的链接：console.cloud.google.com/…
我很快会进一步调查（在下一个可用时间点）：o) 同时，看起来我们已经完成了一半

【解决方案2】：

丑陋但可能是拥有 3 个 group by 语句并将它们联合起来的最简单方法：

SELECT ProductGroup,Product,NULL year ,NULL month, AVG(sales.Revenue) avg  
FROM sales 
GROUP BY ROLLUP(ProductGroup,Product)

UNION DISTINCT
SELECT ProductGroup,Product,Year,NULL month, AVG(sales.Revenue) avg  
FROM sales 
GROUP BY ROLLUP(Year, ProductGroup, Product)

UNION DISTINCT 
SELECT ProductGroup,Product,Year,MONTH, AVG(sales.Revenue) avg  
FROM sales 
GROUP BY ROLLUP(Year, Month, ProductGroup, Product)

GCP fiddle

【讨论】：

我认为在这种情况下您只需要三个：ROLLUP(ProductGroup, Product)、ROLLUP(Year, ProductGroup, Product)、ROLLUP(Year, Month, ProductGroup, Product)。唯一的“重叠”是在所有三个中的()。
@David542 right , right 可以省略union distinct
看起来我们回到了你原来的帖子 - stackoverflow.com/q/69804322/5221944 :o)
我认为我不适合提出这样的问题！您可以向 BigQuery 团队提出功能请求
@eshirvana 你可以投票支持我添加的这个功能请求：issuetracker.google.com/issues/204913323。

【解决方案3】：

它给出了正确的行数，但我将如何传递 AVG(Revenue)？

考虑以下 - 看起来是一个简单的模式，可以应用于更多潜在案例

select r.Year, r.Month, r.ProductGroup, r.Product, 
  round(avg(Revenue), 2) avg_Revenue
from (
  select * from (
    select ProductGroup, Product from Sales group by rollup(ProductGroup, Product)
  ), (
    select Year, Month from Sales group by rollup(Year, Month)
  )
) r 
join sales s
on if(r.Year is null, true, r.Year = s.Year) 
and if(r.Month is null, true, r.Month = s.Month) 
and if(r.ProductGroup is null, true, r.ProductGroup = s.ProductGroup)
and if(r.Product is null, true, r.Product = s.Product) 
group by r.Year, r.Month, r.ProductGroup, r.Product

如果应用于脚本中的示例数据 - 输出为（按顶行切割）

【讨论】：

@mikahil -- 很酷的方法，谢谢。虽然到目前为止，在我所看到的示例中，我认为最干净（不幸的是）是多个 UNION 的 ROLLUP。
当然，同意，虽然拥有更多的工作方法总是比一个更好:o) 我通常会尽量保持帖子温暖，即使已经接受了答案以带来我自己的经验和方法。无论如何，我认为我的回答在某种程度上是/有帮助的：o）同意吗？

【解决方案4】：

编辑：我误读了GROUP BY ROLLUP(A, B), C, D 是可能的，这是一个替代方案。

您可以通过交叉连接到您想要汇总的列的地图上来实现自己的 GROUPING SETS 逻辑...

SELECT
  CASE WHEN include_pg = 1 THEN ProductGroup END,
  CASE WHEN include_p  = 1 THEN Product      END,
  CASE WHEN include_y  = 1 THEN Year         END,
  CASE WHEN include_m  = 1 THEN Month        END,
  AVG(Revenue)
FROM
  Sales
CROSS JOIN
(
              SELECT 1 AS include_pg, 1 AS include_p
    UNION ALL SELECT 1 AS include_pg, 0 AS include_p
    UNION ALL SELECT 0 AS include_pg, 0 AS include_p
)
  AS rollup_pg_p
CROSS JOIN
(
              SELECT 1 AS include_y, 1 AS include_m
    UNION ALL SELECT 1 AS include_y, 0 AS include_m
    UNION ALL SELECT 0 AS include_y, 0 AS include_m
)
  AS rollup_y_m
GROUP BY
  1, 2, 3, 4
ORDER BY
  1, 2, 3, 4

【讨论】：

不幸的是，这在 BigQuery 中不起作用，并且将失败并出现与他的问题中提到的 OP 完全相同的错误
@Mat -- 我在问题中包含了一个包含所有数据的公共查询链接，但如果你想在 BQ 上测试它，这里是它：console.cloud.google.com/…。
@David542 - 如果您需要这些，它们应该在您的问题中。 SO 不是抽象问题的地方，它依赖于具体的例子。
@MatBailie 同意了。我只是针对这种技术指出这一点——它通常可以应用于AVG 之类的东西，但不能用于MED 之类的东西。
@David542 - 新答案，应该适用于任意聚合。在这种情况下，它处理 6 倍的数据并阻碍索引等的使用，因此可能会稍微干净一些，但性能会更差。

【解决方案5】：

与我之前的回答类似，但没有 CASE 表达式“阻止”使用索引。

尽管它仍然处理 9 倍的数据，但可能比基于 CASE 的方法更快。

WITH
  rollup_pg_p AS
(
            SELECT ProductGroup, Product, Year, Month, Revenue FROM Sales 
  UNION ALL SELECT ProductGroup, NULL,    Year, Month, Revenue FROM Sales 
  UNION ALL SELECT NULL,         NULL,    Year, Month, Revenue FROM Sales 
),
  rollup_y_m AS
(
            SELECT ProductGroup, Product, Year, Month, Revenue FROM rollup_pg_p
  UNION ALL SELECT ProductGroup, Product, Year, NULL,  Revenue FROM rollup_pg_p
  UNION ALL SELECT ProductGroup, Product, NULL, NULL,  Revenue FROM rollup_pg_p
)
SELECT
  ProductGroup, Product, Year, Month, AVG(Revenue) FROM rollup_y_m
GROUP BY
  1, 2, 3, 4
ORDER BY
  1, 2, 3, 4

编辑：详细说明。

您的查询是这样的（- 是我在伪代码中对NULL 的简写）...

          SELECT a, b, c, d, AVG(x) FROM src GROUP BY a, b, c, d
UNION ALL SELECT a, b, c, -, AVG(x) FROM src GROUP BY a, b, c
UNION ALL SELECT a, b, -, -, AVG(x) FROM src GROUP BY a, b

UNION ALL SELECT a, -, c, d, AVG(x) FROM src GROUP BY a, c, d
UNION ALL SELECT a, -, c, -, AVG(x) FROM src GROUP BY a, c
UNION ALL SELECT a, -, -, -, AVG(x) FROM src GROUP BY a

UNION ALL SELECT -, -, c, d, AVG(x) FROM src GROUP BY c, d
UNION ALL SELECT -, -, c, -, AVG(x) FROM src GROUP BY c
UNION ALL SELECT -, -, -, -, AVG(x) FROM src

功能上和这个是一样的……

WITH
  combinations AS (
              SELECT a, b, c, d, x FROM src
    UNION ALL SELECT a, b, c, -, x FROM src
    UNION ALL SELECT a, b, -, -, x FROM src
    
    UNION ALL SELECT a, -, c, d, x FROM src
    UNION ALL SELECT a, -, c, -, x FROM src
    UNION ALL SELECT a, -, -, -, x FROM src
    
    UNION ALL SELECT -, -, c, d, x FROM src
    UNION ALL SELECT -, -, c, -, x FROM src
    UNION ALL SELECT -, -, -, -, x FROM src
)
SELECT a, b, c, d, AVG(x) FROM combinations GROUP BY a, b, c, d

后者的优点是您要应用的聚合（或多个聚合）只写入一次，GROUP BY 也是如此。

这仍然需要枚举所有 9 个组合。

所以，一开始的答案只是列举 9 种组合的简写方式。不会短很多，但会稍微短一些。如果你需要ROLLUP(a, b), ROLLUP(c, d), ROLLUP(e, f)，那会更有价值（为每个ROLLUP()写3个组合，总共9个，生成27个组合。）

【讨论】：

非常有趣的方法，你能解释一下这两个 cte 是如何工作的吗？
@David 我的两个答案都与你的答案相同，但表达方式不同；创建您想要的 9 种不同聚合组合的方法。这个答案在第一个 CTE 中创建了 3 个原始数据组合，然后从 CTE 创建了 3 个组合，在原始数据上给出了 9 个总组合。然后聚合一次。

【解决方案6】：

回到我原来的两次分组的答案......

WITH
  rollup_pg_p AS
(
  SELECT
    ProductGroup, Product, 1 AS dummy, Year, Month, SUM(Revenue) AS sum_rev, COUNT(Revenue) AS cnt_row
  FROM
    Sales
  GROUP BY
    ROLLUP(Year, Month, ProductGroup, Product)
  HAVING
    Month IS NOT NULL -- This prevents the roll up going further than desired
                      -- Effectively giving `GROUP BY Year, Month, ROLLUP(ProductGroup, Product)
)
SELECT
  ProductGroup, Product, Year, Month, SUM(sum_rev) / SUM(cnt_row)
FROM
  rollup_pg_p
GROUP BY
  ROLLUP(ProductGroup, Product, Dummy, Year, Month)
HAVING
  dummy IS NOT NULL -- Same 'trick' again, but we created the dummy column
                    -- as ProductGroup and Product CAN legitimately be NULL at this point.
ORDER BY
  1, 2, 3, 4

（注意：其他方言会使用 WHERE NOT GROUPING(Product)，因此请避免使用虚拟列，但 BigQuery 似乎也没有该功能......）

仍然存在不适用于某些聚合的缺点，但可能比替代方法要快得多。

【讨论】：