【问题标题】:Alternative to self-join自加入的替代方案
【发布时间】:2019-08-21 08:27:00
【问题描述】:

我有一个 table-supplynetwork 包括四列:

CustomerID、SupplierID、Supplier_productID、Purchase_Year

.

我想构建一个客户对,其中两个客户在一个焦点年度从同一供应商处购买同一产品。 我使用self-joinBigQuery 中执行此操作。但它太慢了。有什么选择吗?

select distinct
  a.CustomerID as focal_CustomerID,
  b.CustomerID as linked_CustomerID,
  a.Purchase_Year,
  a.Supplier_productID
from 
  supplynetwork as a,
  supplynetwork as b
where 
  a.CustomerID<>b.CustomerID and
  a.Purchase_Year=b.Purchase_Year and
  a.Supplier_productID=b.Supplier_productID and
  a.SupplierID=b.SupplierID

【问题讨论】:

  • 今日提示:始终使用现代、明确的JOIN 语法。更容易编写(没有错误),更容易阅读(和维护),并且在需要时更容易转换为外连接。 (请注意,这不是您的性能问题的答案。)
  • 您应该切换到a.CustomerID &lt; b.CustomerID 以避免重复。现在你得到 A、B 和 B、A。

标签: sql google-bigquery


【解决方案1】:

使用连接语法并索引 CustomerID 列

select distinct
  a.CustomerID as focal_CustomerID,
  b.CustomerID as linked_CustomerID,
  a.Purchase_Year,
  a.Supplier_productID
from 
  supplynetwork as a join
  supplynetwork as b
  on   
  a.Purchase_Year=b.Purchase_Year and
  a.Supplier_productID=b.Supplier_productID and
  a.SupplierID=b.SupplierID
  where a.CustomerID<>b.CustomerID 

【讨论】:

  • 不,执行时间将保持不变,无论您使用哪种语法。如果它不是 oltp,索引可能会有所帮助,因为索引会因多次插入和更新而变慢。
【解决方案2】:

您可以使用聚合在一行中获取所有个满足条件的客户:

select Purchase_Year, Supplier_productID, SupplierID,
       array_agg(distinct CustomerID) as customers
from supplynetwork sn
group by Purchase_Year, Supplier_productID, SupplierID;

然后您可以使用数组操作获得对:

with pss as (
      select Purchase_Year, Supplier_productID, SupplierID,
             array_agg(distinct CustomerID) as customers
      from supplynetwork sn
      group by Purchase_Year, Supplier_productID, SupplierID
     )
select c1, c2, pss.*
from pss cross join
     unnest(pss.customers) c1 cross join
     unnest(pss.customers) c2
where c1 < c2;

【讨论】:

    【解决方案3】:

    您可以使用CROSS JOIN,它(即使是笛卡尔)可能会给您带来简单的好处。试试下面这个查询,看看它是否比你的基线便宜:

    select 
       focal_CustomerID, 
       linked_CustomerID, 
       Purchase_Year, 
       Supplier_ProductID 
    from (
      select 
         SupplierID, 
         Supplier_ProductID, 
         Purchase_Year, 
         array_agg(distinct CustomerID) as Customers
      from `mydataset.mytable`
      group by 1,2,3
    ), unnest(Customers) focal_CustomerID
    cross join unnest(Customers) linked_CustomerID
    where focal_CustomerID != linked_CustomerID
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-09-13
      • 1970-01-01
      • 1970-01-01
      • 2015-07-08
      • 1970-01-01
      • 2015-03-20
      • 1970-01-01
      相关资源
      最近更新 更多