多次在同一个表上连接的性能问题答案

【问题标题】：Performance issue for join on the same tables multiple times多次在同一个表上连接的性能问题
【发布时间】：2017-12-06 20:41:21
【问题描述】：

我面临以下查询的性能问题，其中同一个表多次自联接。如何避免同一张表上的多个联接？

INSERT INTO "TEMP"."TABLE2"
SELECT
T1."PRODUCT_SNO"
,T2."PRODUCT_SNO"
,T3."PRODUCT_SNO"
,T4."PRODUCT_SNO"
,((COUNT(DISTINCT T1."ACCESS_METHOD_ID")(FLOAT)) / 
   (MAX(T5.GROUP_NUM(FLOAT))))
FROM
"TEMP"."TABLE1" T1
,"TEMP"."TABLE1" T2
,"TEMP"."TABLE1" T3
,"TEMP"."TABLE1" T4
,"TEMP"."_TWM_GROUP_COUNT" T5
WHERE
      T1."ACCESS_METHOD_ID" = T2."ACCESS_METHOD_ID"
  AND T2."ACCESS_METHOD_ID" = T3."ACCESS_METHOD_ID"
  AND T3."ACCESS_METHOD_ID" = T4."ACCESS_METHOD_ID"
  AND T1."SUBSCRIPTION_DATE" < T2."SUBSCRIPTION_DATE"
  AND T2."SUBSCRIPTION_DATE" < T3."SUBSCRIPTION_DATE"
  AND T3."SUBSCRIPTION_DATE" < T4."SUBSCRIPTION_DATE"
GROUP BY 1, 2, 3, 4;

这需要 3 小时才能完成。下面是它的解释：

1) First, we lock a distinct TEMP."pseudo table" for write on a
     RowHash to prevent global deadlock for
     TEMP.TABLE2. 
  2) Next, we lock a distinct TEMP."pseudo table" for read on a
     RowHash to prevent global deadlock for TEMP.T5. 
  3) We lock TEMP.TABLE2 for write, we lock
     TEMP.TABLE1 for access, and we lock TEMP.T5 for read. 
  4) We do an all-AMPs RETRIEVE step from TEMP.T5 by way of an
     all-rows scan with no residual conditions into Spool 4 (all_amps),
     which is duplicated on all AMPs.  The size of Spool 4 is estimated
     with high confidence to be 48 rows (816 bytes).  The estimated
     time for this step is 0.01 seconds. 
  5) We execute the following steps in parallel. 
       1) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of
          an all-rows scan, which is joined to TEMP.T4 by way of an
          all-rows scan with no residual conditions.  Spool 4 and
          TEMP.T4 are joined using a product join, with a join
          condition of ("(1=1)").  The result goes into Spool 5
          (all_amps), which is built locally on the AMPs.  Then we do a
          SORT to order Spool 5 by the hash code of (
          TEMP.T4.ACCESS_METHOD_ID).  The size of Spool 5 is
          estimated with high confidence to be 8,051,801 rows (
          233,502,229 bytes).  The estimated time for this step is 1.77
          seconds. 
       2) We do an all-AMPs JOIN step from TEMP.T2 by way of a
          RowHash match scan with no residual conditions, which is
          joined to TEMP.T1 by way of a RowHash match scan with no
          residual conditions.  TEMP.T2 and TEMP.T1 are joined
          using a merge join, with a join condition of (
          "(TEMP.T1.ACCESS_METHOD_ID = TEMP.T2.ACCESS_METHOD_ID)
          AND (TEMP.T1.SUBSCRIPTION_DATE <
          TEMP.T2.SUBSCRIPTION_DATE)").  The result goes into Spool
          6 (all_amps), which is built locally on the AMPs.  The size
          of Spool 6 is estimated with low confidence to be 36,764,681
          rows (1,213,234,473 bytes).  The estimated time for this step
          is 4.12 seconds. 
  6) We do an all-AMPs JOIN step from Spool 5 (Last Use) by way of a
     RowHash match scan, which is joined to TEMP.T3 by way of a
     RowHash match scan with no residual conditions.  Spool 5 and
     TEMP.T3 are joined using a merge join, with a join condition
     of ("(TEMP.T3.SUBSCRIPTION_DATE < SUBSCRIPTION_DATE) AND
     (TEMP.T3.ACCESS_METHOD_ID = ACCESS_METHOD_ID)").  The result
     goes into Spool 7 (all_amps), which is built locally on the AMPs. 
     The size of Spool 7 is estimated with low confidence to be
     36,764,681 rows (1,360,293,197 bytes).  The estimated time for
     this step is 4.14 seconds. 
  7) We do an all-AMPs JOIN step from Spool 6 (Last Use) by way of a
     RowHash match scan, which is joined to Spool 7 (Last Use) by way
     of a RowHash match scan.  Spool 6 and Spool 7 are joined using a
     merge join, with a join condition of ("(SUBSCRIPTION_DATE <
     SUBSCRIPTION_DATE) AND ((ACCESS_METHOD_ID = ACCESS_METHOD_ID) AND
     ((ACCESS_METHOD_ID = ACCESS_METHOD_ID) AND ((ACCESS_METHOD_ID =
     ACCESS_METHOD_ID) AND (ACCESS_METHOD_ID = ACCESS_METHOD_ID ))))"). 
     The result goes into Spool 3 (all_amps), which is built locally on
     the AMPs.  The result spool file will not be cached in memory. 
     The size of Spool 3 is estimated with low confidence to be
     766,489,720 rows (29,893,099,080 bytes).  The estimated time for
     this step is 1 minute and 21 seconds. 
  8) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     TEMP.T1.PRODUCT_SNO ,TEMP.T2.PRODUCT_SNO
     ,TEMP.T3.PRODUCT_SNO ,TEMP.T4.PRODUCT_SNO
     ,TEMP.T1.ACCESS_METHOD_ID).  Aggregate Intermediate Results
     are computed globally, then placed in Spool 9.  The aggregate
     spool file will not be cached in memory.  The size of Spool 9 is
     estimated with low confidence to be 574,867,290 rows (
     46,564,250,490 bytes).  The estimated time for this step is 6
     minutes and 38 seconds. 
  9) We do an all-AMPs SUM step to aggregate from Spool 9 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     TEMP.T1.PRODUCT_SNO ,TEMP.T2.PRODUCT_SNO
     ,TEMP.T3.PRODUCT_SNO ,TEMP.T4.PRODUCT_SNO).  Aggregate
     Intermediate Results are computed globally, then placed in Spool
     11.  The size of Spool 11 is estimated with low confidence to be
     50,625 rows (3,695,625 bytes).  The estimated time for this step
     is 41.87 seconds. 
 10) We do an all-AMPs RETRIEVE step from Spool 11 (Last Use) by way of
     an all-rows scan into Spool 1 (all_amps), which is redistributed
     by the hash code of (TEMP.T1.PRODUCT_SNO,
     TEMP.T2.PRODUCT_SNO, TEMP.T3.PRODUCT_SNO,
     TEMP.T4.PRODUCT_SNO) to all AMPs.  Then we do a SORT to order
     Spool 1 by row hash.  The size of Spool 1 is estimated with low
     confidence to be 50,625 rows (1,873,125 bytes).  The estimated
     time for this step is 0.04 seconds. 
 11) We do an all-AMPs MERGE into TEMP.TABLE2 from
     Spool 1 (Last Use).  The size is estimated with low confidence to
     be 50,625 rows.  The estimated time for this step is 1 second. 
 12) We spoil the parser's dictionary cache for the table. 
 13) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> No rows are returned to the user as the result of statement 1.

收集所有必需的统计信息。

【问题讨论】：

我建议您再问一个问题，包括样本数据、期望的结果以及您想要做什么的解释。同时，学习JOIN 语法，以便您的查询可以进入 21 世纪。
问题是你多次加入同一个表。您还将无条件加入 T5 ......即笛卡尔加入。所以如果 T5 有一个重要的行数，它肯定会很慢。
@GordonLinoff 我的逻辑要求我进行这样的连接。有没有其他方法可以避免这种加入？
@JeffUK T5 没有匹配条件。我有什么选择？
这些表的 DDL 是什么？你的系统有多少个 AMP？_TWM_GROUP_COUNT 中的数据是什么？你做了一些奇怪的事情，COUNT(DISTINCT) 将一直为 1，MAX 可能也很愚蠢。此查询应解决哪个业务问题？

标签： sql performance teradata

【解决方案1】：

我不得不承认我不是 Teradata 专家，但我做了一个快速检查，您可以使用 ANSI JOIN 语法。

所以首先我重写了你的查询以便我能理解它：

INSERT INTO 
    "TEMP"."TABLE2"
SELECT
    T1."PRODUCT_SNO",
    T2."PRODUCT_SNO",
    T3."PRODUCT_SNO",
    T4."PRODUCT_SNO",
    ((COUNT(DISTINCT T1."ACCESS_METHOD_ID")(FLOAT)) / 
        (MAX(T5.GROUP_NUM(FLOAT))))
FROM
    "TEMP"."TABLE1" T1
    INNER JOIN "TEMP"."TABLE1" T2 ON T2."ACCESS_METHOD_ID" = T1."ACCESS_METHOD_ID" 
        AND T2."SUBSCRIPTION_DATE" > T1."SUBSCRIPTION_DATE"
    INNER JOIN "TEMP"."TABLE1" T3 ON T3."ACCESS_METHOD_ID" = T2."ACCESS_METHOD_ID" 
        AND T3."SUBSCRIPTION_DATE" > T2."SUBSCRIPTION_DATE"
    INNER JOIN "TEMP"."TABLE1" T4 ON T4."ACCESS_METHOD_ID" = T3."ACCESS_METHOD_ID" 
        AND T4."SUBSCRIPTION_DATE" > T3."SUBSCRIPTION_DATE"
    CROSS JOIN "TEMP"."_TWM_GROUP_COUNT" T5
GROUP BY 
    T1."PRODUCT_SNO",
    T2."PRODUCT_SNO",
    T3."PRODUCT_SNO",
    T4."PRODUCT_SNO";

请注意，其中许多更改只是个人喜好，但其他更改将“让您的查询进入 21 世纪”；P

现在我可以阅读您的 SQL，我可以对您在此处实际尝试实现的目标做出一些假设：

您有一些表，其中包含产品，每个产品都有一个序列号、一个“访问方法”（不知道这是什么？）和订阅日期；
您正在查找具有相同“访问方法”的产品，然后将它们链接在一起形成订阅日期订单，然后显示链中每个产品的序列号；
每条链的长度必须正好是 4 个产品。不知道如果一个链中的产品少于或多于 4 个会发生什么（我可以看到，如果一个链中的产品少于 4 个，那么这将被丢弃）；
您还有一个指标可以颠覆这种逻辑。现在，您正在计算每个链的不同访问方法的数量，并将其除以来自另一个我们一无所知的表中的某个数字。

这真的不是很多，但我可以看到一些你可以优化的地方：

您只将 _TMW_GROUP_COUNT 表用于一件事，即 MAX(GROUP_NUM)。因此，您可以在主查询之前解决这个问题，然后消除对这种可能昂贵的 JOIN 的需求。我不知道如何使用 Teradata 执行此操作，但在其他 SQL 变体中，您可以将其粘贴到变量中，使用公共表表达式，使用子查询等。如果该表中有很多行，那么就有优化器可能会运行您的查询 x 次，然后丢弃 x-1 个结果集！
任何非 equi 连接都将是低效的，但您似乎无法避免这些。如果您的表没有被 SUBSCRIPTION_DATE 索引，那么它可能有助于对表中的数据进行预排序，添加一个数字订单号（同样在 SQL 的其他变体中，这将是 ROW_NUMBER() OVER (ORDER BY SUBSCRIPTION_DATE) 类型语法. 那么你的日期比较可以是数字比较；
显然索引在这里很重要；
最后，您可以将查询拆分为多个阶段，从 T1 到 T2 连接开始，然后将其用作 (T1 到 T2) 到 T3 连接的基础，等等。这可能无济于事，但值得一试试试？

这可能没什么帮助，但是如果没有一些示例数据等，真的是不够的......

【讨论】：