一 、数据源的准备工作
首先我们去一个网站下载相关的数据,之后通过hive导入进行实验.http://grouplens.org/
二 、内部表
创建内部表并载入数据
-
[email protected]:~$ beeline-u jdbc:hive2://hadoopmaster:10000/ -
Beelineversion2.1.0byApacheHive -
0:jdbc:hive2://hadoopmaster:10000/> show databases; -
OK -
+----------------+--+ -
|database_name| -
+----------------+--+ -
|default| -
|fincials| -
+----------------+--+ -
2rows selected(1.038seconds) -
0:jdbc:hive2://hadoopmaster:10000/> use default; -
OK -
Norows affected(0.034seconds) -
0:jdbc:hive2://hadoopmaster:10000/> create table u_data (userid INT, movieid INT, rating INT, unixtime STRING) row format delimited fields terminated by '\t' lines terminated by '\n'; -
OK -
Norows affected(0.242seconds) -
0:jdbc:hive2://hadoopmaster:10000/> LOAD DATA LOCAL INPATH '/home/hadoop/u.data' OVERWRITE INTO TABLE u_data; -
Loadingdata to tabledefault.u_data -
OK -
Norows affected(0.351seconds) -
0:jdbc:hive2://hadoopmaster:10000/> select * from u_data; -
OK -
+----------------+-----------------+----------------+------------------+--+ -
|u_data.userid|u_data.movieid|u_data.rating|u_data.unixtime| -
+----------------+-----------------+----------------+------------------+--+ -
|196|242|3|881250949| -
|186|302|3|891717742| -
|22|377|1|878887116| -
|244|51|2|880606923| -
|166|346|1|886397596| -
|298|474|4|884182806| -
|115|265|2|881171488| -
|253|465|5|891628467| -
|305|451|3|886324817| -
|6|86|3|883603013| -
|62|257|2|879372434| -
|286|1014|5|879781125|
查看占用的HDFS空间
-
[email protected]:~$ hdfs dfs-ls/user/hive/warehouse/u_data -
Found1items -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:19/user/hive/warehouse/u_data/u.data
写脚本反复导入100次
先查看以前有多少行
-
0:jdbc:hive2://hadoopmaster:10000/> select count(*) from u_data; -
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases. -
QueryID=hadoop_20160722102853_77aa1bc6-79c2-4916-9b07-a763d112ef41 -
Totaljobs=1 -
LaunchingJob1outof1 -
Numberof reduce tasks determined at compile time:1 -
Inorder to change the average loadfora reducer(inbytes): -
sethive.exec.reducers.bytes.per.reducer=<number> -
Inorder to limit the maximum number of reducers: -
sethive.exec.reducers.max=<number> -
Inorder toseta constant number of reducers: -
setmapreduce.job.reduces=<number> -
StartingJob=job_1468978056881_0003,TrackingURL=http://hadoopmaster:8088/proxy/application_1468978056881_0003/ -
KillCommand=/usr/local/hadoop/bin/hadoop job-kill job_1468978056881_0003 -
Hadoopjob informationforStage-1:number of mappers:1;number of reducers:1 -
2016-07-2210:28:58,786Stage-1map=0%,reduce=0% -
2016-07-2210:29:03,890Stage-1map=100%,reduce=0%,CumulativeCPU0.89sec -
2016-07-2210:29:10,005Stage-1map=100%,reduce=100%,CumulativeCPU1.71sec -
MapReduceTotalcumulative CPU time:1seconds710msec -
EndedJob=job_1468978056881_0003 -
MapReduceJobsLaunched: -
Stage-Stage-1:Map:1Reduce:1CumulativeCPU:1.71sec HDFSRead:1987050HDFSWrite:106SUCCESS -
TotalMapReduceCPUTimeSpent:1seconds710msec -
OK -
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases. -
+---------+--+ -
|c0| -
+---------+--+ -
|100000| -
+---------+--+ -
1row selected(17.757seconds) -
hive用Mapreduce引擎计算真心在速度上不行,10W用了17秒,比关系型数据库差不少,还是要用Spark呀
再我们需要了解如何用hive中的一次命令,我们可以这样用.
-
[email protected]:~$ hive-e"LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data;" -
Loadingdata to tabledefault.u_data -
OK -
Timetaken:1.239seconds
最后写脚本
-
#!/bin/bash -
for((c=1;c<=10;c++)) -
do -
echo"正在写入第 $c 次数据..." -
hive-e"LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data;" -
wait -
done
插入完,检查查询成本
-
0:jdbc:hive2://hadoopmaster:10000/> select count(*) from u_data; -
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases. -
QueryID=hadoop_20160722104633_18c3467d-9263-4785-8714-1570fc3bb9ae -
Totaljobs=1 -
LaunchingJob1outof1 -
Numberof reduce tasks determined at compile time:1 -
Inorder to change the average loadfora reducer(inbytes): -
sethive.exec.reducers.bytes.per.reducer=<number> -
Inorder to limit the maximum number of reducers: -
sethive.exec.reducers.max=<number> -
Inorder toseta constant number of reducers: -
setmapreduce.job.reduces=<number> -
StartingJob=job_1468978056881_0009,TrackingURL=http://hadoopmaster:8088/proxy/application_1468978056881_0009/ -
KillCommand=/usr/local/hadoop/bin/hadoop job-kill job_1468978056881_0009 -
Hadoopjob informationforStage-1:number of mappers:1;number of reducers:1 -
2016-07-2210:46:39,037Stage-1map=0%,reduce=0% -
2016-07-2210:46:46,190Stage-1map=100%,reduce=0%,CumulativeCPU1.82sec -
2016-07-2210:46:52,310Stage-1map=100%,reduce=100%,CumulativeCPU2.67sec -
MapReduceTotalcumulative CPU time:2seconds670msec -
EndedJob=job_1468978056881_0009 -
MapReduceJobsLaunched: -
Stage-Stage-1:Map:1Reduce:1CumulativeCPU:2.67sec HDFSRead:77198770HDFSWrite:107SUCCESS -
TotalMapReduceCPUTimeSpent:2seconds670msec -
OK -
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases. -
+----------+--+ -
|c0| -
+----------+--+ -
|3900000| -
+----------+--+ -
1row selected(20.173seconds) -
用了20秒,看起来Mapreduce的启动成本确实有点高了 -
[email protected]:~$ hdfs dfs-ls/user/hive/warehouse/u_data -
Found39items -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:37/user/hive/warehouse/u_data/u.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:38/user/hive/warehouse/u_data/u_copy_1.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:40/user/hive/warehouse/u_data/u_copy_10.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:40/user/hive/warehouse/u_data/u_copy_11.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:41/user/hive/warehouse/u_data/u_copy_12.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:42/user/hive/warehouse/u_data/u_copy_13.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:42/user/hive/warehouse/u_data/u_copy_14.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:42/user/hive/warehouse/u_data/u_copy_15.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:42/user/hive/warehouse/u_data/u_copy_16.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_17.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_18.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_19.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_2.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_20.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_21.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_22.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_23.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:44/user/hive/warehouse/u_data/u_copy_24.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:44/user/hive/warehouse/u_data/u_copy_25.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:44/user/hive/warehouse/u_data/u_copy_26.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:44/user/hive/warehouse/u_data/u_copy_27.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:44/user/hive/warehouse/u_data/u_copy_28.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:44/user/hive/warehouse/u_data/u_copy_29.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_3.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_30.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_31.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_32.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_33.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_34.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_35.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_36.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:46/user/hive/warehouse/u_data/u_copy_37.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:46/user/hive/warehouse/u_data/u_copy_38.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_4.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_5.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_6.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_7.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_8.data -
-rwxrwxr-x2hadoop supergroup19791732016-07-2210:40/user/hive/warehouse/u_data/u_copy_9.data
三 外部表
1 创建外部表并载入数据
-
0:jdbc:hive2://hadoopmaster:10000/> create external table u_data_external_table (userid INT, movieid INT, rating INT, unixtime STRING) row format delimited fields terminated by '\t' lines terminated by '\n'; -
OK -
Norows affected(0.047seconds) -
0:jdbc:hive2://hadoopmaster:10000/> show tables; -
OK -
+------------------------+--+ -
|tab_name| -
+------------------------+--+ -
|employees| -
|t_hive| -
|t_hive2| -
|u_data| -
|u_data_external_table| -
+------------------------+--+ -
5rows selected(0.036seconds)
2 导入数据
-
hive-e"LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data;"
3 内部表与外部表区别
-
我用drop table命令删除刚才创建的二张表,一个内表一个外表之后结果是. -
[email protected]:~$ hdfs dfs-ls/user/hive/warehouse/ -
Found5items -
drwxrwxr-x-hadoop supergroup02016-07-2017:25/user/hive/warehouse/employees -
drwxrwxr-x-hadoop supergroup02016-07-2115:52/user/hive/warehouse/fincials.db -
drwxrwxr-x-hadoop supergroup02016-07-2009:50/user/hive/warehouse/t_hive -
drwxrwxr-x-hadoop supergroup02016-07-2009:54/user/hive/warehouse/t_hive2 -
drwxrwxr-x-hadoop supergroup02016-07-2211:04/user/hive/warehouse/u_data_external_table -
内表的数据完全删除,而外表还有
最后归纳一下Hive中表与外部表的区别:
-
在导入数据到外部表,数据并没有移动到自己的数据仓库目录下,也就是说外部表中的数据并不是由它自己来管理的!而表则不一样;
-
在删除表的时候,Hive将会把属于表的元数据和数据全部删掉;而删除外部表的时候,Hive仅仅删除外部表的元数据,数据是不会删除的!那么,应该如何选择使用哪种表呢?在大多数情况没有太多的区别,因此选择只是个人喜好的问题。但是作为一个经验,如果所有处理都需要由Hive完成,那么你应该创建表,否则使用外部表!
四 分区表
-
0:jdbc:hive2://hadoopmaster:10000/> create table u_data_partitioned_table (userid INT, movieid INT, rating INT, unixtime STRING) partitioned by(day int) row format delimited fields terminated by '\t' lines terminated by '\n'; -
OK -
Norows affected(0.256seconds) -
0:jdbc:hive2://hadoopmaster:10000/> -
0:jdbc:hive2://hadoopmaster:10000/> LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data_partitioned_table partition(day=20160101); -
Loadingdata to tabledefault.u_data_partitioned_table partition(day=20160101) -
OK -
Norows affected(0.424seconds) -
0:jdbc:hive2://hadoopmaster:10000/> -
100,000rows selected(4.653seconds) -
0:jdbc:hive2://hadoopmaster:10000/> LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data_partitioned_table partition(day=20160101); -
Loadingdata to tabledefault.u_data_partitioned_table partition(day=20160101) -
OK -
Norows affected(0.424seconds) -
0:jdbc:hive2://hadoopmaster:10000/> LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data_partitioned_table partition(day=20160102); -
Loadingdata to tabledefault.u_data_partitioned_table partition(day=20160102) -
OK -
Norows affected(0.499seconds) -
0:jdbc:hive2://hadoopmaster:10000/> -
[email protected]:~$ hdfs dfs-ls/user/hive/warehouse/u_data_partitioned_table -
Found2items -
drwxrwxr-x-hadoop supergroup02016-07-2213:51/user/hive/warehouse/u_data_partitioned_table/day=20160101 -
drwxrwxr-x-hadoop supergroup02016-07-2213:51/user/hive/warehouse/u_data_partitioned_table/day=20160102
五 分桶表
-
0:jdbc:hive2://hadoopmaster:10000/> CREATE TABLE bucketed_data_user (userid INT, movieid INT, rating INT, unixtime STRING) CLUSTERED BY (userid) INTO 4 BUCKETS row format delimited fields terminated by '\t' lines terminated by '\n'; -
OK -
Norows affected(0.045seconds) -
0:jdbc:hive2://hadoopmaster:10000/> -
0:jdbc:hive2://hadoopmaster:10000/> insert overwrite table bucketed_data_user select userid,movieid,rating,unixtime from u_data_partitioned_table; -
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases. -
QueryID=hadoop_20160722140142_c272bc07-b74d-4b5b-9689-0bec2ce71780 -
Totaljobs=1 -
LaunchingJob1outof1 -
Numberof reduce tasks determined at compile time:4 -
Inorder to change the average loadfora reducer(inbytes): -
sethive.exec.reducers.bytes.per.reducer=<number> -
Inorder to limit the maximum number of reducers: -
sethive.exec.reducers.max=<number> -
Inorder toseta constant number of reducers: -
setmapreduce.job.reduces=<number> -
StartingJob=job_1468978056881_0010,TrackingURL=http://hadoopmaster:8088/proxy/application_1468978056881_0010/ -
KillCommand=/usr/local/hadoop/bin/hadoop job-kill job_1468978056881_0010 -
Hadoopjob informationforStage-1:number of mappers:1;number of reducers:4 -
2016-07-2214:01:48,774Stage-1map=0%,reduce=0% -
2016-07-2214:01:55,978Stage-1map=100%,reduce=0%,CumulativeCPU1.89sec -
2016-07-2214:02:06,236Stage-1map=100%,reduce=50%,CumulativeCPU5.66sec -
2016-07-2214:02:07,272Stage-1map=100%,reduce=100%,CumulativeCPU9.43sec -
MapReduceTotalcumulative CPU time:9seconds430msec -
EndedJob=job_1468978056881_0010 -
Loadingdata to tabledefault.bucketed_data_user -
MapReduceJobsLaunched: -
Stage-Stage-1:Map:1Reduce:4CumulativeCPU:9.43sec HDFSRead:5959693HDFSWrite:5937879SUCCESS -
TotalMapReduceCPUTimeSpent:9seconds430msec -
OK -
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases. -
Norows affected(26.251seconds) -
0:jdbc:hive2://hadoopmaster:10000/> -
0:jdbc:hive2://hadoopmaster:10000/> select count(*) from bucketed_data_user ; -
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases. -
QueryID=hadoop_20160722141056_eaf582be-4107-403a-bacd-0a18f567f576 -
Totaljobs=1 -
LaunchingJob1outof1 -
Numberof reduce tasks determined at compile time:1 -
Inorder to change the average loadfora reducer(inbytes): -
sethive.exec.reducers.bytes.per.reducer=<number> -
Inorder to limit the maximum number of reducers: -
sethive.exec.reducers.max=<number> -
Inorder toseta constant number of reducers: -
setmapreduce.job.reduces=<number> -
StartingJob=job_1468978056881_0012,TrackingURL=http://hadoopmaster:8088/proxy/application_1468978056881_0012/ -
KillCommand=/usr/local/hadoop/bin/hadoop job-kill job_1468978056881_0012 -
Hadoopjob informationforStage-1:number of mappers:1;number of reducers:1 -
2016-07-2214:11:04,156Stage-1map=0%,reduce=0% -
2016-07-2214:11:09,331Stage-1map=100%,reduce=0%,CumulativeCPU0.94sec -
2016-07-2214:11:15,488Stage-1map=100%,reduce=100%,CumulativeCPU1.78sec -
MapReduceTotalcumulative CPU time:1seconds780msec -
EndedJob=job_1468978056881_0012 -
MapReduceJobsLaunched: -
Stage-Stage-1:Map:1Reduce:1CumulativeCPU:1.78sec HDFSRead:5945855HDFSWrite:106SUCCESS -
TotalMapReduceCPUTimeSpent:1seconds780msec -
OK -
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases. -
+---------+--+ -
|c0| -
+---------+--+ -
|300000| -
+---------+--+ -
1row selected(20.397seconds) -
0:jdbc:hive2://hadoopmaster:10000/> -
[email protected]:~$ hdfs dfs-ls/user/hive/warehouse/bucketed_data_user -
Found4items -
-rwxrwxr-x2hadoop supergroup14009942016-07-2214:02/user/hive/warehouse/bucketed_data_user/000000_0 -
-rwxrwxr-x2hadoop supergroup14938562016-07-2214:02/user/hive/warehouse/bucketed_data_user/000001_0 -
-rwxrwxr-x2hadoop supergroup15667382016-07-2214:02/user/hive/warehouse/bucketed_data_user/000002_0 -
-rwxrwxr-x2hadoop supergroup14759312016-07-2214:02/user/hive/warehouse/bucketed_data_user/000003_0