数据导入

向表中装载数据(Load)

语法

load data [local] inpath '/opt/module/datas/student.txt' 
overwrite | into table student [partition (partcol1=val1,…)]
参数 说明
load data 加载数据
local 从本地加载数据到hive表;否则从HDFS加载数据到hive表
inpath 表示加载数据的路径
overwriter 表示覆盖表中的已有数据,否则表示追加
into table 表示加载到哪张表
student 表示具体的表
partition 表示上传到指定的分区

实操案例

创建表
create table student(id string, name string) row format delimited fields terminated by '\t';
加载本地文件到hive
load data local inpath '/opt/module/datas/student.txt' into table student;
hive (default)> load data local inpath '/opt/module/datas/student.txt' into table student;
Loading data to table default.student
Table default.student stats: [numFiles=1, totalSize=27]
OK
Time taken: 1.144 seconds
hive (default)> select * from student;
OK
student.id      student.name
1       zhangsan
2       lisi
3       wangwu
Time taken: 0.085 seconds, Fetched: 3 row(s)
加载HDFS文件到hive中

上传文件到HDFS

dfs -put /opt/module/datas/student.txt /user/hive/warehouse;

加载HDFS数据

load data inpath '/user/hive/warehouse/student.txt' into table student;

加载数据覆盖表中已有的数据

load data inpath '/user/hive/warehouse/student.txt' overwrite into table student;

Hive DML 数据操作

通过查询语句向表中插入数据(Insert)

创建一个分区表

create table student2(id int, name string) partitioned by (month string) row format delimited fields terminated by '\t'

插入数据

基本插入
insert into table  student2 partition(month='201905') values(1,'wangwu');

insert into 需要执行MR任务。

hive (default)> insert into table  student partition(month='201905') values(1,'wangwu');
FAILED: SemanticException table is not partitioned but partition spec exists: {month=201905}
hive (default)> insert into table  student2 partition(month='201905') values(1,'wangwu');
Query ID = root_20190502135403_30a43b47-ec6a-4bbf-b407-50be007327d8
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1554120237694_0003, Tracking URL = http://hadoop101:8088/proxy/application_1554120237694_0003/
Kill Command = /opt/module/hadoop-2.7.2/bin/hadoop job  -kill job_1554120237694_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-05-02 13:54:30,454 Stage-1 map = 0%,  reduce = 0%
2019-05-02 13:54:39,718 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.4 sec
MapReduce Total cumulative CPU time: 2 seconds 400 msec
Ended Job = job_1554120237694_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop101:9000/user/hive/warehouse/student2/month=201905/.hive-staging_hive_2019-05-02_13-54-03_419_1815339772519319942-1/-ext-10000
Loading data to table default.student2 partition (month=201905)
Partition default.student2{month=201905} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.4 sec   HDFS Read: 3652 HDFS Write: 94 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 400 msec
OK
_col0   _col1
Time taken: 41.99 seconds
hive (default)> select * from student2;
OK
student2.id     student2.name   student2.month
1       wangwu  201905
Time taken: 0.973 seconds, Fetched: 1 row(s)
基本模式插入(根据单张表查询结果)
insert overwrite table student2 partition(month='201904') 
select id, name from student2 where month='201905';
hive (default)> insert overwrite table student2 partition(month='201904') select id, name from student2 where month='201905';
Query ID = root_20190502135725_8518579c-6415-4312-ac66-85d523b85a74
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1554120237694_0004, Tracking URL = http://hadoop101:8088/proxy/application_1554120237694_0004/
Kill Command = /opt/module/hadoop-2.7.2/bin/hadoop job  -kill job_1554120237694_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-05-02 13:57:37,006 Stage-1 map = 0%,  reduce = 0%
2019-05-02 13:57:46,786 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.57 sec
MapReduce Total cumulative CPU time: 1 seconds 570 msec
Ended Job = job_1554120237694_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop101:9000/user/hive/warehouse/student2/month=201904/.hive-staging_hive_2019-05-02_13-57-25_946_1497692349000822439-1/-ext-10000
Loading data to table default.student2 partition (month=201904)
Partition default.student2{month=201904} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.57 sec   HDFS Read: 3606 HDFS Write: 94 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 570 msec
OK
id      name
Time taken: 23.601 seconds
hive (default)> select * from student2;
OK
student2.id     student2.name   student2.month
1       wangwu  201904
1       wangwu  201905
Time taken: 0.162 seconds, Fetched: 2 row(s)
多插入模式(根据多张表查询结果)
from student2 
insert overwrite table student2 partition(month='201901')
select id, name where month='201904'
insert overwrite table student2 partition(month='201902')
select id, name where month='201905';
hive (default)> from student2 
              > insert overwrite table student2 partition(month='201901')
              > select id, name where month='201904'
              > insert overwrite table student2 partition(month='201902')
              > select id, name where month='201905';
Query ID = root_20190502140047_d0fc48db-4e42-43c1-a7ea-3fcb22acd77d
Total jobs = 5
Launching Job 1 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1554120237694_0005, Tracking URL = http://hadoop101:8088/proxy/application_1554120237694_0005/
Kill Command = /opt/module/hadoop-2.7.2/bin/hadoop job  -kill job_1554120237694_0005
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 0
2019-05-02 14:00:56,920 Stage-2 map = 0%,  reduce = 0%
2019-05-02 14:01:05,735 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.97 sec
MapReduce Total cumulative CPU time: 1 seconds 970 msec
Ended Job = job_1554120237694_0005
Stage-5 is selected by condition resolver.
Stage-4 is filtered out by condition resolver.
Stage-6 is filtered out by condition resolver.
Stage-11 is selected by condition resolver.
Stage-10 is filtered out by condition resolver.
Stage-12 is filtered out by condition resolver.
Moving data to: hdfs://hadoop101:9000/user/hive/warehouse/student2/month=201901/.hive-staging_hive_2019-05-02_14-00-47_034_306334479514022067-1/-ext-10000
Moving data to: hdfs://hadoop101:9000/user/hive/warehouse/student2/month=201902/.hive-staging_hive_2019-05-02_14-00-47_034_306334479514022067-1/-ext-10002
Loading data to table default.student2 partition (month=201901)
Loading data to table default.student2 partition (month=201902)
Partition default.student2{month=201901} stats: [numFiles=1, numRows=0, totalSize=9, rawDataSize=0]
Partition default.student2{month=201902} stats: [numFiles=1, numRows=0, totalSize=9, rawDataSize=0]
MapReduce Jobs Launched: 
Stage-Stage-2: Map: 1   Cumulative CPU: 1.97 sec   HDFS Read: 5460 HDFS Write: 188 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 970 msec
OK
id      name
Time taken: 21.918 seconds
hive (default)> select * from student2;
OK
student2.id     student2.name   student2.month
1       wangwu  201901
1       wangwu  201902
1       wangwu  201904
1       wangwu  201905
Time taken: 0.31 seconds, Fetched: 4 row(s)

查询语句中创建表并加载数据(As Select)

根据查询结果创建表,查询的结果会添加到新创建的表中

create table if not exists student3 as select id, name from student;
hive (default)> create table if not exists student3
              > as select id, name from student;
Query ID = root_20190502140249_4f88537d-a1e3-4181-b659-42ba313491c8
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1554120237694_0006, Tracking URL = http://hadoop101:8088/proxy/application_1554120237694_0006/
Kill Command = /opt/module/hadoop-2.7.2/bin/hadoop job  -kill job_1554120237694_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-05-02 14:03:01,836 Stage-1 map = 0%,  reduce = 0%
2019-05-02 14:03:10,732 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.2 sec
MapReduce Total cumulative CPU time: 1 seconds 200 msec
Ended Job = job_1554120237694_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop101:9000/user/hive/warehouse/.hive-staging_hive_2019-05-02_14-02-49_692_5998422552783035794-1/-ext-10001
Moving data to: hdfs://hadoop101:9000/user/hive/warehouse/student3
Table default.student3 stats: [numFiles=1, numRows=3, totalSize=27, rawDataSize=24]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.2 sec   HDFS Read: 2905 HDFS Write: 99 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 200 msec
OK
id      name
Time taken: 24.143 seconds
hive (default)> select * from student3;
OK
student3.id     student3.name
1       zhangsan
2       lisi
3       wangwu
Time taken: 0.06 seconds, Fetched: 3 row(s)

创建表时通过Location指定加载数据路径

创建表

create table if not exists student4(
    id int, name string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/student4';

上传数据到HDFS

dfs -put /opt/module/datas/student.txt /user/hive/warehouse/student4;

查询数据

hive (default)> select * from student4;
OK
student4.id     student4.name
1       zhangsan
2       lisi
3       wangwu
Time taken: 0.06 seconds, Fetched: 3 row(s)

Import数据到指定Hive表中

注意:先用export导出后,再将数据导入。

import table student2 partition(month='201909') from
'/user/hive/warehouse/export/student';

数据导出

Insert导出

基本用法

insert overwrite local directory '/opt/module/datas/export/student' select * from student;
insert需要启动MR任务
hive (default)> insert overwrite local directory '/opt/module/datas/export/student' select * from student;
Query ID = root_20190502141319_edb4e6fd-386d-499e-b74f-c13ba9caa89d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1554120237694_0007, Tracking URL = http://hadoop101:8088/proxy/application_1554120237694_0007/
Kill Command = /opt/module/hadoop-2.7.2/bin/hadoop job  -kill job_1554120237694_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-05-02 14:13:28,169 Stage-1 map = 0%,  reduce = 0%
2019-05-02 14:13:35,777 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.06 sec
MapReduce Total cumulative CPU time: 1 seconds 60 msec
Ended Job = job_1554120237694_0007
Copying data to local directory /opt/module/datas/export/student
Copying data to local directory /opt/module/datas/export/student
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.06 sec   HDFS Read: 2980 HDFS Write: 27 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 60 msec
OK
student.id      student.name
Time taken: 19.052 seconds
查询结果
[[email protected] student]# pwd
/opt/module/datas/export/student
[[email protected] student]# cat 000000_0 
1zhangsan
2lisi
3wangwu

将查询的结果格式化导出到本地

insert overwrite local directory '/opt/module/datas/export/student1'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'  select * from student;
查询结果
[[email protected] export]# cat student1/000000_0 
1       zhangsan
2       lisi
3       wangwu

将查询的结果导出到HDFS上(没有local)

insert overwrite directory '/user/hive/warehouse/export_stu'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
select * from student;

Hive DML 数据操作

Hadoop命令导出到本地

dfs -get /user/hive/warehouse/student2/month=201901/000000_0 
/opt/module/datas/export/stu_201901.txt;
[[email protected] datas]# ls export/
stu_201901.txt  student  student1
[[email protected] datas]# cat export/stu_201901.txt 
1       wangwu

Hive Shell 命令导出

基本语法:(hive -f/-e 执行语句或者脚本 > file)

hive -e 'select * from default.student;' > /opt/module/datas/export/student4.txt;

Export导出到HDFS上

export table default.student to '/user/hive/warehouse/export/student';

Sqoop导出

后续补充!

清除表中的数据(Truncate)

注意:Truncate只能删除管理表,不能删除外部表中数据

truncate table student;

相关文章: