在 hive 中高效插入最新记录答案

【问题标题】：Insert latest records efficiently in hive在 hive 中高效插入最新记录
【发布时间】：2022-01-06 14:50:07
【问题描述】：

我在 hive 中有大约 90 个表，每个表 10 个使用 union all in 合并到 9 个主表中。

这 90 个基表每 15 分钟插入一次新行。我们需要每 15 分钟在主表中引入新插入的行。

用“not in”检查 ID 需要一些时间。

我也有时间戳列，根据它获取数据也需要时间

有没有一种有效的方法来实现这一点。 " 每 15 分钟将基表中新添加的记录插入到主表中"

【问题讨论】：

标签： hive hql

【解决方案1】：

我能想到两个选择。

选项 1 - 您可以创建一个新表以保留每个主、阶段组合的最大日期时间戳。表应该是这样的

masters,stages, mxts
master1,stage1, 2021-01-01 12:30:30
...

然后像上面的sql一样在sql中使用。

select * from Staging table-1 s 
Join maxtimestamp On timestamp > mxts and stages='stage1' and masters='master1'
union all
select * from Staging table-2 s 
Join maxtimestamp On timestamp > mxts and stages='stage2'and masters='master1'

然后在加载后每天将 max timespamp 插入到新表中。

选项 2 - 如果您可以向名为 record_created_by 的主表添加一个新列，以跟踪哪个阶段正在创建数据。你的插入语句会是这样的

select s.*, 'master1~stage1' as record_created_by from Staging table-1 s 
Join (select max(timestamp) mxts from master where record_created_by='master1~stage1') mx On timestamp > mxts
union all
select s.*, 'master1~stage2' as record_created_by from Staging table-2 s 
Join (select max(timestamp) mxts from master where record_created_by='master1~stage2') mx On timestamp > mxts

请注意您的第一次插入语句与 sql 上面的语句相同，但没有时间戳部分。如果你有多个阶段，你可以像这个 sql 一样添加它们。

第一个选项更快，但您需要创建和维护一个新表。

【讨论】：

是的，我也在考虑第一个选项，实现了它。现在有了数据量。 “不在”比带有连接的中间表稍快。当前数据为数千
听起来不错。我很高兴我的回答有帮助:)