本地与全局 Teradata 聚合计算答案

【问题标题】：Teradata Aggregation computation locally vs globally本地与全局 Teradata 聚合计算
【发布时间】：2014-01-15 15:17:04
【问题描述】：

我有两个表 Table1 和 Table2，两个表的主索引分别为 col1、col2、col3 和 col4。我加入这两个表并在一组包含表的 PI 的列上进行分组。有人能告诉我为什么在解释计划中我得到“聚合中间结果是全局计算的” 而不是本地。我的理解是，当按列分组时包含所有 PI 列聚合结果是在本地而不是全局计算的。

select
A.col1
,A.col2
,A.col3
,A.col4
,col5
,col6
,col7
,col8
,col9
,SUM(col10)
,COUNT(col11)
table1 A
left outer join
table2 B
on A.col1 = B.col1
A.col2 = B.col2
A.col3 = B.col3
A.col4 = B.col4
group by A.col1,A.col2,A.col3,A.col4,col5,col6,col7,col8,col9

以下是查询的解释计划

        1) First, we lock a distinct DATEBASE_NAME."pseudo table" for read on a
        RowHash to prevent global deadlock for DATEBASE_NAME.S. 
        2) Next, we lock a distinct DATEBASE_NAME."pseudo table" for write on a
        RowHash to prevent global deadlock for
        DATEBASE_NAME.TARGET_TABLE. 
        3) We lock a distinct DATEBASE_NAME."pseudo table" for read on a RowHash
        to prevent global deadlock for DATEBASE_NAME.E. 
        4) We lock DATEBASE_NAME.S for read, we lock
        DATEBASE_NAME.TARGET_TABLE for write, and we lock
        DATEBASE_NAME.E for read. 
        5) We do an all-AMPs JOIN step from DATEBASE_NAME.S by way of a RowHash
        match scan with no residual conditions, which is joined to
        DATEBASE_NAME.E by way of a RowHash match scan.  DATEBASE_NAME.S and
        DATEBASE_NAME.E are left outer joined using a merge join, with
        condition(s) used for non-matching on left table ("(NOT
        (DATEBASE_NAME.S.col1 IS NULL )) AND ((NOT
        (DATEBASE_NAME.S.col2 IS NULL )) AND ((NOT
        (DATEBASE_NAME.S.col3 IS NULL )) AND (NOT
        (DATEBASE_NAME.S.col4 IS NULL ))))"), with a join condition of (
        "(DATEBASE_NAME.S.col1 = DATEBASE_NAME.E.col1) AND
        ((DATEBASE_NAME.S.col2 = DATEBASE_NAME.E.col2) AND
        ((DATEBASE_NAME.S.col3 = DATEBASE_NAME.E.col3) AND
        (DATEBASE_NAME.S.col4 = DATEBASE_NAME.E.col4 )))").  The input
        table DATEBASE_NAME.S will not be cached in memory.  The result goes
        into Spool 3 (all_amps), which is built locally on the AMPs.  The
        result spool file will not be cached in memory.  The size of Spool
        3 is estimated with low confidence to be 675,301,664 rows (
        812,387,901,792 bytes).  The estimated time for this step is 3
        minutes and 37 seconds. 
        6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
        way of an all-rows scan , grouping by field1 (
        DATEBASE_NAME.S.col1 ,DATEBASE_NAME.S.col2
        ,DATEBASE_NAME.S.col3 ,DATEBASE_NAME.S.col4
        ,DATEBASE_NAME.E.col5
        ,DATEBASE_NAME.S.col6 ,DATEBASE_NAME.S.col7
        ,DATEBASE_NAME.S.col8 ,DATEBASE_NAME.S.col9).  Aggregate
        Intermediate Results are computed globally, then placed in Spool 4. 
        The aggregate spool file will not be cached in memory.  The size
        of Spool 4 is estimated with low confidence to be 506,476,248 rows
        (1,787,354,679,192 bytes).  The estimated time for this step is 1
        hour and 1 minute. 
        7) We do an all-AMPs MERGE into DATEBASE_NAME.TARGET_TABLE
        from Spool 4 (Last Use).  The size is estimated with low
        confidence to be 506,476,248 rows.  The estimated time for this
        step is 33 hours and 12 minutes. 
        8) We spoil the parser's dictionary cache for the table. 
        9) Finally, we send out an END TRANSACTION step to all AMPs involved
        in processing the request.
        -> No rows are returned to the user as the result of statement 1.

【问题讨论】：

标签： teradata

【解决方案1】：

您只需使用 col1,col2,col3,col4 进行聚合那么它会在本地聚合？

来自此网址的更多详细信息： http://www.teradataforum.com/teradata/20040526_133730.htm

【讨论】：

但是当我的 PI 是 Group by 列的子集时，所有要聚合的数据都已经在同一个 AMP 中，不是吗？
我明天试试

【解决方案2】：

我相信这是因为中间线轴。您正在使用该线轴中的列，而不是原始表中的列进行分组。我能够使用 volatile 表在本地计算聚合中间结果。

在这种情况下，基本上发生的事情是我从第 5 步中取出了线轴，给它命名并在其上强制执行 PI。由于 volatile 表的 PI 与初始表相同，因此 volatile 表的生成也是本地 amp 操作。

CREATE VOLATILE TABLE x AS
(
SELECT
A.col1
,A.col2
,A.col3
,A.col4
,col5
,col6
,col7
,col8
,col9
--,SUM(col10)
--,COUNT(col11)
from
table1 A
left outer join
table2 B
on A.col1 = B.col1
A.col2 = B.col2
A.col3 = B.col3
A.col4 = B.col4
--group by A.col1,A.col2,A.col3,A.col4,col5,col6,col7,col8,col9
)
WITH DATA PRIMARY INDEX (col1, col2, col3, col4)
;

SELECT
col1
,col2
,col3
,col4
,col5
,col6
,col7
,col8
,col9
SUM(col10)
COUNT(col11)
from
x
GROUP BY 
col1,col2,col3,col4,col5,col6,col7,col8,col9

【讨论】：