【问题标题】:How to Pivot in Google BigQuery [duplicate]如何在 Google BigQuery 中旋转 [重复]
【发布时间】:2015-08-23 06:22:24
【问题描述】:

假设我向 BQ 发送了以下查询:

SELECT shipmentID, category, quantity
FROM [myDataset.myTable]

进一步,假设查询返回的数据如:

shipmentID  category  quantity
1           shoes     5
1           hats      3
2           shirts    1
2           hats      2
3           toys      3
2           books     1
3           shirts    1

如何在 BQ 中旋转结果以产生如下输出:

 shipmentID   shoes  hats  shirts  toys  books
 1            5      3     0       0     0
 2            0      2     1       0     1
 3            0      0     1       3     0

作为一些额外的背景,我实际上有 2000 多个类别需要转置,而且数据量如此之大,以至于我无法直接通过 Python 中的 Pandas DataFrame 来完成(使用所有内存,然后减慢到爬行)。我尝试使用关系数据库,但遇到了列限制,所以我希望能够直接在 BQ 中完成,即使我必须通过 python 构建查询本身。有什么建议吗?

** 编辑 1 我应该提到,数据本身的旋转可以分块完成,因此不是问题。真正的麻烦在于尝试在之后进行聚合,因此每个shipmentID 我只有一行。这就是吃掉所有 RAM 的原因。

** 编辑 2 在尝试了下面接受的答案后,我发现尝试使用它来创建 2k+ 列数据透视表会导致“超出资源”错误。我的 BQ 团队能够重构查询以将其分解为更小的块并允许它通过。查询的基本结构如下:

SELECT
  SetA.*,
  SetB.*,
  SetC.*
FROM (
  SELECT
    shipmentID,
    SUM(IF (category="Rocks", qty, 0)),
    SUM(IF (category="Paper", qty, 0)),
    SUM(IF (category="Scissors", qty, 0))
  FROM (
    SELECT
      a.shipmentid shipmentid,
      a.quantity quantity,
      a.category category
    FROM
      [myDataset.myTable] a)
  GROUP EACH BY
    shipmentID ) SetA
INNER JOIN EACH (
  SELECT
    shipmentID,
    SUM(IF (category="Jello Molds", quantity, 0)),
    SUM(IF (category="Torque Wrenches", quantity, 0))
  FROM (
    SELECT
      a.shipmentID shipmentID,
      a.quantity quantity,
      a.category category
    FROM
      [myDataset.myTable] a)
  GROUP EACH BY
    shipmentID ) SetB
ON
  SetA.shipmentid = SetB.shipmentid
INNER JOIN EACH (
  SELECT
    shipmentID,
    SUM(IF (category="Deep Thoughts", qty, 0)),
    SUM(IF (category="Rainbows", qty, 0)),
    SUM(IF (category="Ponies", qty, 0))
  FROM (
    SELECT
      a.shipmentid shipmentid,
      a.quantity quantity,
      a.category category
    FROM
      [myDataset.myTable] a)
  GROUP EACH BY
    shipmentID ) SetC
ON
  SetB.shipmentID = SetC.shipmentID

上述模式可以通过一个接一个地添加INNER JOIN EACH段来无限期地继续下去。对于我的应用程序,BQ 能够处理每个块大约 500 列。

【问题讨论】:

  • 您是否必须手动输入所有类别?我遇到了类似的问题,这似乎工作量太大。
  • 不知道您是否尝试过使用 PySpark 数据框进行旋转?

标签: python pandas google-bigquery


【解决方案1】:

这是一种方法:

select shipmentID,
  sum(IF (category='shoes', quantity, 0)) AS shoes,
  sum(IF (category='hats', quantity, 0)) AS hats,
  sum(IF (category='shirts', quantity, 0)) AS shirts,
  sum(IF (category='toys', quantity, 0)) AS toys,
  sum(IF (category='books', quantity, 0)) AS books,
from
  (select 1 as shipmentID,           'shoes' as category,    5 as quantity),
  (select 1 as shipmentID,           'hats' as category,      3 as quantity),
  (select 2 as shipmentID,           'shirts' as category,    1 as quantity),
  (select 2 as shipmentID,           'hats' as category,      2 as quantity),
  (select 3 as shipmentID,           'toys' as category,      3 as quantity),
  (select 2 as shipmentID,           'books' as category,     1 as quantity),
  (select 3 as shipmentID,           'shirts' as category,    1 as quantity),
group by shipmentID

这会返回:

+-----+------------+-------+------+--------+------+-------+---+
| Row | shipmentID | shoes | hats | shirts | toys | books |   |
+-----+------------+-------+------+--------+------+-------+---+
|   1 |          1 |     5 |    3 |      0 |    0 |     0 |   |
|   2 |          2 |     0 |    2 |      1 |    0 |     1 |   |
|   3 |          3 |     0 |    0 |      1 |    3 |     0 |   |
+-----+------------+-------+------+--------+------+-------+---+

查看其他pivot table example的手册。

【讨论】:

  • 这看起来不错,并且在给定类别列表的情况下,以编程方式构建查询应该相对容易。我会试一试。谢谢! :)
  • 对此进行了测试,并且在一定程度上有效。不幸的是,尝试在 2000 多个类别上运行它会产生“执行期间超出资源”。错误,但它适用于较少数量的类别。
  • @TraxusIV 您可以联系谷歌支持,或者尝试使用生成的查询发布新问题并提及失败的工作 ID,BQ 团队的人员将能够查看是否可以增加限制与否。
  • 是的,我继续将它交给我们的 BQ 团队。他们将与谷歌讨论这件事。公平地说,这是一个巨大的数据量和处理量。尝试使用 32GB 内存在我的笔记本电脑上进行旋转,但它只是窒息。
  • 如果使用合理,他们可以增加限制。除此之外,您可以重新组织表格。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2022-12-11
  • 2022-10-08
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-01-10
  • 1970-01-01
相关资源
最近更新 更多