hive - 分区表答案

【问题标题】：hive - partitioning tablehive - 分区表
【发布时间】：2016-08-20 18:22:15
【问题描述】：

我用查询创建了一个配置单元表 -

create table studpart4(id int, name string) partitioned by (course string, year int) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;

创建成功。

使用以下命令加载数据 -

load data local inpath '/scratch/hive_inputs/student_input_1.txt' overwrite into table studpart4 partition(course='cse',year=2);

我的输入数据文件看起来像 -

 101    student1    cse 1

 102    student2    cse 2

 103    student3    eee 3

 104    student4    eee 4

 105    student5    cse 1

 106    student6    cse 2

 107    student7    eee 3

 108    student8    eee 4

 109    student9    cse 1

 110    student10   cse 2

但输出显示为 (select * from studpart4) --

 101    student1    cse 2

 102    student2    cse 2

 103    student3    eee 2

 104    student4    eee 2

 105    student5    cse 2

 106    student6    cse 2

 107    student7    eee 2

 108    student8    eee 2

 109    student9    cse 2

 110    student10   cse 2

为什么最后一列全是2。为什么改错更新了。

【问题讨论】：

stackoverflow.com/a/13224581/2079249

标签： hadoop hive

【解决方案1】：

您显示的结果正是您告诉 Hive 处理您的数据的结果。

在您的第一个命令中，您将创建一个分区表 studpart4，其中包含两列 id 和 name，以及两个分区键 course 和 year（一旦创建，其行为就像常规列）。现在，在你的第二个命令中，你正在做的是：

load data local inpath '/scratch/hive_inputs/student_input_1.txt' overwrite into table studpart4 partition(course='cse',year=2)

这基本上意味着“复制来自student_input_1.txt的所有数据到表studpart4中，并将列course的所有值设置为'cse'和列@987654331的所有值@ 到‘2’”。在内部，Hive 将创建一个包含您的分区键的目录结构。您的数据将存储在如下目录中：

.../studpart4/course=cse/year=2/

我怀疑您真正想要的是 Hive 在您的 .txt 文件中检测 course 和 year 的列值并为您设置正确的值。为了执行此操作，您必须使用表的dynamic partitioning 并遵循loading 将您的数据放入外部表的策略，然后使用INSERT OVERWRITE INTO TABLE 命令将数据存储到您的studpart4 表中。 BigDataLearner 在评论中发布的链接描述了这种策略。

我希望这会有所帮助。

【讨论】：

优秀。感谢您的详细解释。我现在澄清了。