用于更新表的 BigQuery 视图答案

【问题标题】：BigQuery View to Update Table用于更新表的 BigQuery 视图
【发布时间】：2022-01-20 02:16:38
【问题描述】：

我有一个记录表，其中包含需要处理的原始数据，有时需要设置目标表以避免资源错误。

目前我正在使用 BigQuery 视图来处理结果并将结果保存在另一个 BigQuery 表中，并设置了计划查询以覆盖该表。

随着数据量的增长，我发现成本越来越高，我该如何以更高效/更好的实践来构建它以节省成本？

我目前的 BigQuery View 脚本逻辑是这样的：

with latest_timestamp as(
select max(timestamp) latest from persist_table
),

select col1, col2, col3 from logging_table where timestamp >= (select latest from latest_timestamp)
union all
select * from persist_table where timestamp < (select latest from latest_timestamp)

我必须使用时间戳，因为时间戳是分区列，以避免结果中出现重复/丢失的数据。不确定是否有其他更好的方法可以做到这一点，所以我愿意接受任何建议。

【问题讨论】：

标签： sql google-bigquery bigdata

【解决方案1】：

以下步骤应该使您只插入新行，避免您每次都读取和插入整个表格。请记住，Bigquery 根据读取的字节数向您收费。因此，使用分区而不必每次都读取整个表来重新插入，从而节省成本。

确保所有表都按时间戳分区（如果尚未完成）（logging_table 和 persist_table）：这大大减少了需要读取的数据；
将您的日程查询更改为以下内容：

with latest_timestamp as(
select max(timestamp) latest from persist_table
)

select col1, col2, col3  from logging_table where timestamp > (select latest from latest_timestamp)
union all
(select t1.col1, t1.col2, t1.col3 from 
(select col1, col2, col3  from logging_table where timestamp = (select latest from latest_timestamp)) t1
left join 
(select * from persist_table where timestamp = (select latest from latest_timestamp)) t2
on
(t1.col1=t2.col1 and t1.col2=t2.col2 and t1.col3=t2.col3)
where
t2.col1 is null)

与

将Overwrite更改为Append to table：

【讨论】：

你好@Coastie。这是answer your question吗？
它可以工作，但处理仍然与原始查询相同。我使用左连接而不是右连接，以便从 logging_table 中获取具有相同时间戳但尚未处理到persist_table 中的所有记录。
1.你是对的，改左。 2.如果处理相同，则说明分区有误，或者没有使用。按定义，查询使用分区过滤器的数据量和整个表必须不同。请阅读答案的所有细节。 “确保所有表都按时间戳分区”
我交叉检查了所有步骤并确认两个表都使用时间戳进行了分区。我理解这里的逻辑，当我们只选择数据 >= 时间戳而不是原始的
无论如何我建议您继续使用这种方法，因为您附加您可以降低摄取数据成本