Hive DML 数据操作

数据导入

向表中装载数据（Load）

语法

load data [local] inpath '/opt/module/datas/student.txt' 
overwrite | into table student [partition (partcol1=val1,…)]

参数	说明
load data	加载数据
local	从本地加载数据到hive表；否则从HDFS加载数据到hive表
inpath	表示加载数据的路径
overwriter	表示覆盖表中的已有数据，否则表示追加
into table	表示加载到哪张表
student	表示具体的表
partition	表示上传到指定的分区

实操案例

创建表

create table student(id string, name string) row format delimited fields terminated by '\t';

加载本地文件到hive

load data local inpath '/opt/module/datas/student.txt' into table student;

hive (default)> load data local inpath '/opt/module/datas/student.txt' into table student;
Loading data to table default.student
Table default.student stats: [numFiles=1, totalSize=27]
OK
Time taken: 1.144 seconds
hive (default)> select * from student;
OK
student.id      student.name
1       zhangsan
2       lisi
3       wangwu
Time taken: 0.085 seconds, Fetched: 3 row(s)

加载HDFS文件到hive中

上传文件到HDFS

dfs -put /opt/module/datas/student.txt /user/hive/warehouse;

加载HDFS数据

load data inpath '/user/hive/warehouse/student.txt' into table student;

加载数据覆盖表中已有的数据

load data inpath '/user/hive/warehouse/student.txt' overwrite into table student;

Hive DML 数据操作

通过查询语句向表中插入数据（Insert）

创建一个分区表

create table student2(id int, name string) partitioned by (month string) row format delimited fields terminated by '\t'

插入数据

基本插入

insert into table  student2 partition(month='201905') values(1,'wangwu');

insert into 需要执行MR任务。

hive (default)> insert into table  student partition(month='201905') values(1,'wangwu');
FAILED: SemanticException table is not partitioned but partition spec exists: {month=201905}
hive (default)> insert into table  student2 partition(month='201905') values(1,'wangwu');
Query ID = root_20190502135403_30a43b47-ec6a-4bbf-b407-50be007327d8
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1554120237694_0003, Tracking URL = http://hadoop101:8088/proxy/application_1554120237694_0003/
Kill Command = /opt/module/hadoop-2.7.2/bin/hadoop job  -kill job_1554120237694_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-05-02 13:54:30,454 Stage-1 map = 0%,  reduce = 0%
2019-05-02 13:54:39,718 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.4 sec
MapReduce Total cumulative CPU time: 2 seconds 400 msec
Ended Job = job_1554120237694_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop101:9000/user/hive/warehouse/student2/month=201905/.hive-staging_hive_2019-05-02_13-54-03_419_1815339772519319942-1/-ext-10000
Loading data to table default.student2 partition (month=201905)
Partition default.student2{month=201905} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.4 sec   HDFS Read: 3652 HDFS Write: 94 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 400 msec
OK
_col0   _col1
Time taken: 41.99 seconds
hive (default)> select * from student2;
OK
student2.id     student2.name   student2.month
1       wangwu  201905
Time taken: 0.973 seconds, Fetched: 1 row(s)

基本模式插入（根据单张表查询结果）

insert overwrite table student2 partition(month='201904') 
select id, name from student2 where month='201905';

hive (default)> insert overwrite table student2 partition(month='201904') select id, name from student2 where month='201905';
Query ID = root_20190502135725_8518579c-6415-4312-ac66-85d523b85a74
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1554120237694_0004, Tracking URL = http://hadoop101:8088/proxy/application_1554120237694_0004/
Kill Command = /opt/module/hadoop-2.7.2/bin/hadoop job  -kill job_1554120237694_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-05-02 13:57:37,006 Stage-1 map = 0%,  reduce = 0%
2019-05-02 13:57:46,786 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.57 sec
MapReduce Total cumulative CPU time: 1 seconds 570 msec
Ended Job = job_1554120237694_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop101:9000/user/hive/warehouse/student2/month=201904/.hive-staging_hive_2019-05-02_13-57-25_946_1497692349000822439-1/-ext-10000
Loading data to table default.student2 partition (month=201904)
Partition default.student2{month=201904} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.57 sec   HDFS Read: 3606 HDFS Write: 94 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 570 msec
OK
id      name
Time taken: 23.601 seconds
hive (default)> select * from student2;
OK
student2.id     student2.name   student2.month
1       wangwu  201904
1       wangwu  201905
Time taken: 0.162 seconds, Fetched: 2 row(s)

多插入模式（根据多张表查询结果）

from student2 
insert overwrite table student2 partition(month='201901')
select id, name where month='201904'
insert overwrite table student2 partition(month='201902')
select id, name where month='201905';

hive (default)> from student2 
              > insert overwrite table student2 partition(month='201901')
              > select id, name where month='201904'
              > insert overwrite table student2 partition(month='201902')
              > select id, name where month='201905';
Query ID = root_20190502140047_d0fc48db-4e42-43c1-a7ea-3fcb22acd77d
Total jobs = 5
Launching Job 1 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1554120237694_0005, Tracking URL = http://hadoop101:8088/proxy/application_1554120237694_0005/
Kill Command = /opt/module/hadoop-2.7.2/bin/hadoop job  -kill job_1554120237694_0005
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 0
2019-05-02 14:00:56,920 Stage-2 map = 0%,  reduce = 0%
2019-05-02 14:01:05,735 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.97 sec
MapReduce Total cumulative CPU time: 1 seconds 970 msec
Ended Job = job_1554120237694_0005
Stage-5 is selected by condition resolver.
Stage-4 is filtered out by condition resolver.
Stage-6 is filtered out by condition resolver.
Stage-11 is selected by condition resolver.
Stage-10 is filtered out by condition resolver.
Stage-12 is filtered out by condition resolver.
Moving data to: hdfs://hadoop101:9000/user/hive/warehouse/student2/month=201901/.hive-staging_hive_2019-05-02_14-00-47_034_306334479514022067-1/-ext-10000
Moving data to: hdfs://hadoop101:9000/user/hive/warehouse/student2/month=201902/.hive-staging_hive_2019-05-02_14-00-47_034_306334479514022067-1/-ext-10002
Loading data to table default.student2 partition (month=201901)
Loading data to table default.student2 partition (month=201902)
Partition default.student2{month=201901} stats: [numFiles=1, numRows=0, totalSize=9, rawDataSize=0]
Partition default.student2{month=201902} stats: [numFiles=1, numRows=0, totalSize=9, rawDataSize=0]
MapReduce Jobs Launched: 
Stage-Stage-2: Map: 1   Cumulative CPU: 1.97 sec   HDFS Read: 5460 HDFS Write: 188 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 970 msec
OK
id      name
Time taken: 21.918 seconds
hive (default)> select * from student2;
OK
student2.id     student2.name   student2.month
1       wangwu  201901
1       wangwu  201902
1       wangwu  201904
1       wangwu  201905
Time taken: 0.31 seconds, Fetched: 4 row(s)

查询语句中创建表并加载数据（As Select）

根据查询结果创建表，查询的结果会添加到新创建的表中

create table if not exists student3 as select id, name from student;

hive (default)> create table if not exists student3
              > as select id, name from student;
Query ID = root_20190502140249_4f88537d-a1e3-4181-b659-42ba313491c8
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1554120237694_0006, Tracking URL = http://hadoop101:8088/proxy/application_1554120237694_0006/
Kill Command = /opt/module/hadoop-2.7.2/bin/hadoop job  -kill job_1554120237694_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-05-02 14:03:01,836 Stage-1 map = 0%,  reduce = 0%
2019-05-02 14:03:10,732 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.2 sec
MapReduce Total cumulative CPU time: 1 seconds 200 msec
Ended Job = job_1554120237694_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop101:9000/user/hive/warehouse/.hive-staging_hive_2019-05-02_14-02-49_692_5998422552783035794-1/-ext-10001
Moving data to: hdfs://hadoop101:9000/user/hive/warehouse/student3
Table default.student3 stats: [numFiles=1, numRows=3, totalSize=27, rawDataSize=24]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.2 sec   HDFS Read: 2905 HDFS Write: 99 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 200 msec
OK
id      name
Time taken: 24.143 seconds
hive (default)> select * from student3;
OK
student3.id     student3.name
1       zhangsan
2       lisi
3       wangwu
Time taken: 0.06 seconds, Fetched: 3 row(s)

创建表时通过Location指定加载数据路径

创建表

create table if not exists student4(
    id int, name string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/student4';

上传数据到HDFS

dfs -put /opt/module/datas/student.txt /user/hive/warehouse/student4;

查询数据

hive (default)> select * from student4;
OK
student4.id     student4.name
1       zhangsan
2       lisi
3       wangwu
Time taken: 0.06 seconds, Fetched: 3 row(s)

Import数据到指定Hive表中

注意：先用export导出后，再将数据导入。

import table student2 partition(month='201909') from
'/user/hive/warehouse/export/student';

数据导出

Insert导出

基本用法

insert overwrite local directory '/opt/module/datas/export/student' select * from student;

insert需要启动MR任务

hive (default)> insert overwrite local directory '/opt/module/datas/export/student' select * from student;
Query ID = root_20190502141319_edb4e6fd-386d-499e-b74f-c13ba9caa89d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1554120237694_0007, Tracking URL = http://hadoop101:8088/proxy/application_1554120237694_0007/
Kill Command = /opt/module/hadoop-2.7.2/bin/hadoop job  -kill job_1554120237694_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-05-02 14:13:28,169 Stage-1 map = 0%,  reduce = 0%
2019-05-02 14:13:35,777 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.06 sec
MapReduce Total cumulative CPU time: 1 seconds 60 msec
Ended Job = job_1554120237694_0007
Copying data to local directory /opt/module/datas/export/student
Copying data to local directory /opt/module/datas/export/student
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.06 sec   HDFS Read: 2980 HDFS Write: 27 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 60 msec
OK
student.id      student.name
Time taken: 19.052 seconds

查询结果

[[email protected] student]# pwd
/opt/module/datas/export/student
[[email protected] student]# cat 000000_0 
1zhangsan
2lisi
3wangwu

将查询的结果格式化导出到本地

insert overwrite local directory '/opt/module/datas/export/student1'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'  select * from student;

查询结果

[[email protected] export]# cat student1/000000_0 
1       zhangsan
2       lisi
3       wangwu

将查询的结果导出到HDFS上(没有local)

insert overwrite directory '/user/hive/warehouse/export_stu'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
select * from student;

Hive DML 数据操作

Hadoop命令导出到本地

dfs -get /user/hive/warehouse/student2/month=201901/000000_0 
/opt/module/datas/export/stu_201901.txt;

[[email protected] datas]# ls export/
stu_201901.txt  student  student1
[[email protected] datas]# cat export/stu_201901.txt 
1       wangwu

Hive Shell 命令导出

基本语法：（hive -f/-e 执行语句或者脚本 > file）

hive -e 'select * from default.student;' > /opt/module/datas/export/student4.txt;

Export导出到HDFS上

export table default.student to '/user/hive/warehouse/export/student';

Sqoop导出

后续补充！

清除表中的数据（Truncate）

注意：Truncate只能删除管理表，不能删除外部表中数据

truncate table student;