【问题标题】:Snowflake query engine strategy on several with query conditions雪花查询引擎策略上的几种带查询条件
【发布时间】:2021-10-20 03:11:03
【问题描述】:

我正在执行从 pyspark 查询到雪花查询的迁移工作,并且想知道以下 A、B 选项之间哪个选项更好。

为避免不必要的查询,如果没有显着的性能差异,我想选择 B 选项。

在 B 选项中,雪花查询引擎是否自动优化并且内部行为类似于 A 选项?

一个选项

With A1 AS (select * from a1 where date='2021-10-20'),
A2 AS (select * from a2 where date='2021-10-20'),
A3 AS (select * from a3 where date='2021-10-20'),
A4 AS (select * from a4 where date='2021-10-20'),
A5 AS (select * from a5 where date='2021-10-20')
SELECT *
FROM final_merged_table

和B选项

With A1 AS (select * from a1),
A2 AS (select * from a2),
A3 AS (select * from a3),
A4 AS (select * from a4),
A5 AS (select * from a5)
SELECT *
FROM final_merged_table
WHERE date = '2021-10-20'

【问题讨论】:

  • 假设您的 CTE 每次都应该从前一个表表达式中读取并且“final_merged_table”应该是 A5 是否安全?
  • 在实际代码中,CTE相互依赖多次,例如A3是A1、A2连接的结果,A5是A3、A4连接的结果。但是为了简单起见,可以假设最终的_merged_table是所有A1~A5的联合表。

标签: sql snowflake-cloud-data-platform query-engine


【解决方案1】:

我们可以对此进行测试。首先,让我们构建一个包含一周日期和几百万行的表:

create or replace table one_week2
as
select '2020-04-01'::date + (7*seq8()/100000000)::int day, random() data, random() data2, random() data3
from table(generator(rowcount => 100000000))

现在我们可以编写两个查询来遍历这个表:

选项 1:

With A1 AS (select * from one_week2 where day='2020-04-05'),
A2 AS (select * from one_week2 where day='2020-04-05'),
A3 AS (select * from one_week2 where day='2020-04-05'),
A4 AS (select * from one_week2 where day='2020-04-05'),
A5 AS (select * from one_week2 where day='2020-04-05'),
final_merged_table as (
    select * from a1 
    union all select * from a2
    union all select * from a3
    union all select * from a4
    union all select * from a5)

SELECT count(*)
FROM final_merged_table

选项 2:

With A1 AS (select * from one_week2),
A2 AS (select * from one_week2),
A3 AS (select * from one_week2),
A4 AS (select * from one_week2),
A5 AS (select * from one_week2),
final_merged_table as (
    select * from a1 
    union all select * from a2
    union all select * from a3
    union all select * from a4
    union all select * from a5)

SELECT count(*)
FROM final_merged_table
where day='2020-04-05'
;

当我们运行这些查询时,两者的配置文件看起来相同 - 因为过滤器已被按下:

选项 1 配置文件

选项 2 配置文件

总结

您可以信任 Snowflake 优化器。

信任很重要,但也要验证:有时优化器可能会被复杂的 CTE 弄糊涂。有时 Snowflake engs 会优化优化器,今天不起作用的东西,明天可以更好地工作。

【讨论】:

    猜你喜欢
    • 2022-01-04
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-05-22
    • 2021-10-02
    • 2023-02-24
    • 2022-07-18
    相关资源
    最近更新 更多