【发布时间】:2017-08-29 01:57:41
【问题描述】:
我有以下问题。
我有三张桌子 1.) 记录所有曾经运行过的实例的所有用户历史表 2.) 记录所有实例曾经运行的所有实例表 3.)我可以识别活动实例的实例表
我的目标是获取属于活动实例的所有用户。
问题是所有用户表都包含 1370 亿条记录,因此无法通过一次查询将其连接起来。
到目前为止我最好的查询:
SELECT allcontact.users FROM allcontact
WHERE EXISTS
(
SELECT 1
FROM allinstances
WHERE allinstances.instances = allcontact.instances
AND EXISTS
(
SELECT 1
FROM activeinstances
WHERE 1 = 1
AND activeinstances.end_date> CURRENT_TIMESTAMP
AND activeinstances.run_id = allinstances.run_id
AND activeinstances.run_date = allinstances.run_date))
QUALIFY ROW_NUMBER() OVER (PARTITION BY allcontact.users ORDER BY allcontact.users DESC)=1
目前它适用于以下逻辑。它检查 end_date 大于当前日期的所有运行,然后从 allinstance 表中获取满足这些条件的所有实例。然而,这个查询最终会出现假脱机空间问题。
我需要这样做的原因是一次运行可能包含 activeinstances 表中不存在的实例,因此我需要根据运行日期和 run_id 进行所有运行,并从所有实例表中查找这些实例。此查询为我提供了正确的结果,但我只能在减少最终生产中无法执行的结果数量时才能运行它。
如果我创建一个包含所有实例的易失性表并将其与 allcontact 表连接,我就可以运行它。然而,在这个查询应该去的最终产品中,我无法创建易失性表。
如果有人能就如何通过一个查询来运行它提出建议,我将不胜感激。
环境是基于 teradata 的 IBM 活动。
谢谢!
编辑添加了更多内容
主键: 所有联系人表 PK:cntct_id
所有实例表 PK:实例
活动实例表 PK:实例
解释计划:
SELECT allcontact.users FROM allcontact AS cntct
WHERE EXISTS ( SELECT 1 FROM allinstances WHERE allinstances.instances = allcontact.instances AND EXISTS(SELECT 1 FROM activeinstances WHERE 1 = 1 AND activeinstances.end_date > CURRENT_TIMESTAMP AND activeinstances.run_id = allinstances.run_id AND activeinstances.run_date = allinstances.run_date)) QUALIFY ROW_NUMBER() OVER (PARTITION BY allcontact.users ORDER BY allcontact.users DESC)=1;
This query is optimized using type 2 profile cp_rowkey, profileid
10006. 1) First, we lock ACTIVEINSTANCES for access, we
lock ALLCONTACT in view allcontact for
access, and we lock allinstances for access. 2) Next, we execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from allinstances
by way of an all-rows scan with a condition of (
"allinstances.TRTMNT_TYPE_CODE <> 'I'") into
Spool 3 (all_amps), which is redistributed by the hash code
of (allinstances.RUN_DATE,
allinstances.RUN_ID) to all AMPs. The size of
Spool 3 is estimated with low confidence to be 4,612,364 rows
(119,921,464 bytes). The estimated time for this step is
0.50 seconds.
2) We do an all-AMPs RETRIEVE step from
ACTIVEINSTANCES by way of an all-rows
scan with a condition of (
"(CAST((ACTIVEINSTANCES.END_DATE)
AS TIMESTAMP(6) WITH TIME ZONE))> TIMESTAMP '2017-08-28
01:55:35.110000+00:00'") into Spool 4 (all_amps), which is
redistributed by the hash code of (
ACTIVEINSTANCES.RUN_DATE,
ACTIVEINSTANCES.RUN_ID) to all AMPs.
Then we do a SORT to order Spool 4 by row hash and the sort
key in spool field1 eliminating duplicate rows. The size of
Spool 4 is estimated with no confidence to be 132,623 rows (
4,907,051 bytes). The estimated time for this step is 0.01
seconds.
3) We do an all-AMPs RETRIEVE step from ALLCONTACT
in view allcontact by way of an all-rows scan
with no residual conditions into Spool 5 (all_amps) fanned
out into 17 hash join partitions, which is built locally on
the AMPs. The input table will not be cached in memory, but
it is eligible for synchronized scanning. The size of Spool
5 is estimated with high confidence to be 138,065,479,155
rows (3,451,636,978,875 bytes). The estimated time for this
step is 1 minute and 19 seconds. 3) We do an all-AMPs JOIN step from Spool 3 (Last Use) by way of an
all-rows scan, which is joined to Spool 4 (Last Use) by way of an
all-rows scan. Spool 3 and Spool 4 are joined using a single
partition inclusion hash join, with a join condition of (
"(TRTMNT_TYPE_CODE NOT IN ('I')) AND ((RUN_DATE =
RUN_DATE) AND (RUN_ID = RUN_ID ))"). The result goes into
Spool 7 (all_amps), which is redistributed by the hash code of (
allinstances.INSTANCES) to all AMPs. Then we do a
SORT to order Spool 7 by the sort key in spool field1 eliminating
duplicate rows. The size of Spool 7 is estimated with no
confidence to be 496,670 rows (12,416,750 bytes). The estimated
time for this step is 9.84 seconds. 4) We do an all-AMPs RETRIEVE step from Spool 7 (Last Use) by way of
an all-rows scan into Spool 6 (all_amps) fanned out into 17 hash
join partitions, which is duplicated on all AMPs. The size of
Spool 6 is estimated with no confidence to be 1,862,512,500 rows (
46,562,812,500 bytes). 5) We do an all-AMPs JOIN step from Spool 5 (Last Use) by way of an
all-rows scan, which is joined to Spool 6 (Last Use) by way of an
all-rows scan. Spool 5 and Spool 6 are joined using a inclusion
hash join of 17 partitions, with a join condition of ("INSTANCES =
INSTANCES"). The result goes into Spool 2 (all_amps), which is
built locally on the AMPs. The size of Spool 2 is estimated with
no confidence to be 34,652,542,903 rows (797,008,486,769 bytes).
The estimated time for this step is 23.71 seconds. 6) We do an all-AMPs STAT FUNCTION step from Spool 2 (Last Use) by
way of an all-rows scan into Spool 12 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 1 (group_amps), which is built locally on the AMPs.
The size is estimated with no confidence to be 650,694,038 rows (
24,075,679,406 bytes). 7) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request. -> The contents of Spool 1 are sent back to the user as the result of
statement 1.
BEGIN RECOMMENDED STATS FOR FINAL PLAN->
-- "COLLECT STATISTICS COLUMN (RUN_ID ,RUN_DATE) ON
ACTIVEINSTANCES" (High Confidence)
-- "COLLECT STATISTICS COLUMN (CAST((END_DATE) AS
TIMESTAMP(6) WITH TIME ZONE)) AS
ACTIVEINSTANCES ON
ACTIVEINSTANCES" (High Confidence)
<- END RECOMMENDED STATS FOR FINAL PLAN
当前有效的查询:
SELECT Distinct t.users
FROM
(SELECT users, instances FROM allcontacts
JOIN
(SELECT DISTINCT Run_dt
FROM activeinstances
WHERE activeinstances.end_date> Cast(Current_Timestamp AS TIMESTAMP)
) AS drv on drv.Run_dt = allcontacts.run_dt) as t
JOIN
(
SELECT DISTINCT allinstances.instances
FROM allinstances
JOIN ( SELECT DISTINCT run_date, run_id
FROM activeinstances
WHERE activeinstances.end_date> Cast(Current_Timestamp AS TIMESTAMP)
) AS activeinstances
ON activeinstances.run_id = allinstances.run_id
AND activeinstances.run_date = allinstances.run_date
) AS dt
ON dt.instances = allcontact.instances
【问题讨论】:
-
可以添加 DDL&PKs/FKs 以及当前查询的解释吗?
-
嗨@dnoeth为此添加了更多内容。
-
你能把
allinstances.instances = activeinstances.instances加到最里面的EXISTS吗?activeinstances.end_date,DATE 和 TIMESTAMP 的数据类型是什么?activeinstances.end_date> CURRENT_TIMESTAMP的实际行数与估计的 132,623 行相比是多少?顺便说一句,您的 PK 可能是主索引,而不是逻辑主键... -
嗨@dnoeth,您对主键的看法是正确的,它们实际上是主索引。我没有看到任何按原样定义的主键。 'activeinstances.end_date' 是 TIMESTAMP。通过以下查询 SELECT COUNT(*) FROM activeinstances WHERE 1 = 1 AND activeinstances.end_date> CURRENT_TIMESTAMP 我只得到 1054 行。
-
如果我添加这个条件 Allinstances.instances = activeinstances.instances,它不会拾取属于 run_id、run_date 并且存在于 allinstances 但不存在于 activeinstances 视图中的实例。跨度>