在少数情况下，Hive 插入覆盖会截断表答案

【问题标题】：Hive insert overwrites truncates the table in few cases在少数情况下，Hive 插入覆盖会截断表
【发布时间】：2018-05-31 19:07:54
【问题描述】：

我正在研究一种解决方案，发现在某些特定情况下，hive insert overwrite 截断表格，但在少数情况下它不会。有人能解释一下它的行为吗？

为了解释这一点，我是两个表，source 和 target 并尝试使用 insert overwrite

将数据从源表插入主表

当源表有分区时

如果源表有分区并且如果你写了一个分区不存在的条件，那么它不会截断主表。

create table source (name String) partitioned by (age int);
insert into source partition (age) values("gaurang", 11);
create table target (name String, age int);
insert into target partition (age) values("xxx", 99);

以下查询不会截断表，即使 select 没有返回任何内容。

insert overwrite  table temp.test12 select * from temp.test11 where name="Ddddd" and age=99;

但是，以下查询会截断表格。

insert overwrite  table temp.target select * from temp.test11 where name="Ddddd" and age=11;

在第一种情况下是有意义的，因为 partition(age=99) 不存在，因此它应该进一步停止执行查询。但是这是我的假设，不确定到底发生了什么。

当源表没有分区，但目标有 在这种情况下，即使源表中的 select 语句返回 0 行，目标表也不会被截断。

use temp;
drop table if exists source1;
drop table if exists target1;
create table source1 (name String, age int);
create table target1 (name String) partitioned by (age int);
insert into source1 values ("gaurang", 11);
insert into target1 partition(age) values("xxx", 99);
select  * from source1;
select * from target1;

即使在 select 语句中没有找到数据，以下查询也不会截断表。

insert overwrite table temp.target1 partition(age) select * from temp.source1 where age=90;

当 Source 或 Target 没有分区时

在这种情况下，如果我尝试插入覆盖目标并且选择语句不返回任何行，那么目标表将被截断。检查下面的示例。

use temp;
drop table if exists source1;
drop table if exists target1;
create table source1 (name String, age int);
create table target1 (name String, age int);
insert into source1 values ("gaurang", 11);
insert into target1 values("xxx", 99);
select  * from source1;
select * from target1;

以下查询将截断目标表。

   insert overwrite table temp.target1 select * from temp.source1 where age=90;

【问题讨论】：

检查两个表是外部的还是内部的？
@user3508766 这与外部/托管无关。 Hive 中托管和外部之间的区别仅在于 drop table 行为。

标签： hadoop hive hiveql hadoop-partitioning

【解决方案1】：

最好使用术语'overwrite' 而不是truncate，因为这正是insert overwrite 期间发生的事情。

当您编写overwrite table temp.target1 partition(age) 时，您指示 Hive 覆盖分区，而不是所有 target1 表，仅覆盖那些将由 select 返回的分区。

空数据集不会覆盖动态分区模式下的分区。因为要覆盖的分区是未知的，所以应该从dataset中取出partition，dataset是空的，那就什么都不覆盖了。

并且在未分区表的情况下，已经知道它应该覆盖所有表，无所谓，是否为空数据集。

insert overwrite 语句中的分区列应该是最后一个。并且目标中要覆盖的分区列表=数据集返回的分区列中的值列表，与源表的分区方式无关（您可以从任何源表列中选择目标分区列，计算它或使用常量)，只有返回的内容才重要。

【讨论】：