重写连接查询答案

【问题标题】：Re-writing a join query重写连接查询
【发布时间】：2017-05-26 09:13:22
【问题描述】：

我有一个关于 Hive 的问题。让我给你解释一下场景：

我正在对 Oozie 使用 Hive 动作；我有一个正在做的查询在不同的表上连续LEFT JOIN；
要插入的总行数约为3500万；
首先，作业因内存不足而崩溃，因此我设置了“set hive.auto.convert.join=false”，查询已完美执行，但花了 4 小时 完成；
我尝试重写 LEFT JOIN 将大表放在最后的顺序，但结果相同，大约要执行 4 小时；

查询如下所示：

INSERT OVERWRITE TABLE final_table
SELECT 
T1.Id,
T1.some_field_name,
T1.another_filed_name,

T2.also_another_filed_name,

FROM table1 T1
LEFT JOIN table2 T2 ON ( T2.Id = T1.Id ) -- T2 is the smallest table
LEFT JOIN table3 T3 ON ( T3.Id = T1.Id )
LEFT JOIN table4 T4 ON ( T4.Id = T1.Id ) -- T4 is the biggest table

那么，知道查询的结构有没有办法重写它，这样我就可以避免太多的 JOIN 了？

提前致谢

PS：即使矢量化也给了我同样的时机

【问题讨论】：

标签： hadoop hive left-join query-optimization

【解决方案1】：

评论太长，稍后将被删除。

(1) 您当前的查询无法编译。
(2) 您没有从 T3 和 T4 中选择任何内容，这使得没有意义。
(3) 更改表的顺序不太可能对基于成本的优化器产生任何影响。
(4) 基本上我会建议收集有关表的统计信息，特别是 id 列，但在您的情况下，我感觉 id 在多个表中并不是唯一的。

将以下查询的结果添加到您的帖子中：

select      *
           ,    case when cnt_1 = 0 then 1 else cnt_1 end
            *   case when cnt_2 = 0 then 1 else cnt_2 end
            *   case when cnt_3 = 0 then 1 else cnt_3 end
            *   case when cnt_4 = 0 then 1 else cnt_4 end   as product


from       (select      id
                       ,count(case when tab = 1 then 1 end) as cnt_1
                       ,count(case when tab = 2 then 1 end) as cnt_2
                       ,count(case when tab = 3 then 1 end) as cnt_3
                       ,count(case when tab = 4 then 1 end) as cnt_4

            from       (            select 1 as tab,id from table1
                        union all   select 2 as tab,id from table2  
                        union all   select 3 as tab,id from table3
                        union all   select 4 as tab,id from table4 
                        ) t

            group by    id

            having      greatest (cnt_1,cnt_2,cnt_3,cnt_4) >= 10
            ) t 

order by    product desc

limit       10
;

【讨论】：