一 、数据源的准备工作

首先我们去一个网站下载相关的数据,之后通过hive导入进行实验.http://grouplens.org/

Hive综合案例实战

二 、内部表

创建内部表并载入数据

  1. [email protected]:~$ beeline-u jdbc:hive2://hadoopmaster:10000/

  2. Beelineversion2.1.0byApacheHive

  3. 0:jdbc:hive2://hadoopmaster:10000/> show databases;

  4. OK

  5. +----------------+--+

  6. |database_name|

  7. +----------------+--+

  8. |default|

  9. |fincials|

  10. +----------------+--+

  11. 2rows selected(1.038seconds)

  12. 0:jdbc:hive2://hadoopmaster:10000/> use default;

  13. OK

  14. Norows affected(0.034seconds)

  15. 0:jdbc:hive2://hadoopmaster:10000/> create table u_data (userid INT, movieid INT, rating INT, unixtime STRING) row format delimited fields terminated by '\t' lines terminated by '\n';

  16. OK

  17. Norows affected(0.242seconds)

  18. 0:jdbc:hive2://hadoopmaster:10000/> LOAD DATA LOCAL INPATH '/home/hadoop/u.data' OVERWRITE INTO TABLE u_data;

  19. Loadingdata to tabledefault.u_data

  20. OK

  21. Norows affected(0.351seconds)

  22. 0:jdbc:hive2://hadoopmaster:10000/> select * from u_data;

  23. OK

  24. +----------------+-----------------+----------------+------------------+--+

  25. |u_data.userid|u_data.movieid|u_data.rating|u_data.unixtime|

  26. +----------------+-----------------+----------------+------------------+--+

  27. |196|242|3|881250949|

  28. |186|302|3|891717742|

  29. |22|377|1|878887116|

  30. |244|51|2|880606923|

  31. |166|346|1|886397596|

  32. |298|474|4|884182806|

  33. |115|265|2|881171488|

  34. |253|465|5|891628467|

  35. |305|451|3|886324817|

  36. |6|86|3|883603013|

  37. |62|257|2|879372434|

  38. |286|1014|5|879781125|

查看占用的HDFS空间

  1. [email protected]:~$ hdfs dfs-ls/user/hive/warehouse/u_data

  2. Found1items

  3. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:19/user/hive/warehouse/u_data/u.data

写脚本反复导入100次

先查看以前有多少行

  1. 0:jdbc:hive2://hadoopmaster:10000/> select count(*) from u_data;

  2. WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases.

  3. QueryID=hadoop_20160722102853_77aa1bc6-79c2-4916-9b07-a763d112ef41

  4. Totaljobs=1

  5. LaunchingJob1outof1

  6. Numberof reduce tasks determined at compile time:1

  7. Inorder to change the average loadfora reducer(inbytes):

  8. sethive.exec.reducers.bytes.per.reducer=<number>

  9. Inorder to limit the maximum number of reducers:

  10. sethive.exec.reducers.max=<number>

  11. Inorder toseta constant number of reducers:

  12. setmapreduce.job.reduces=<number>

  13. StartingJob=job_1468978056881_0003,TrackingURL=http://hadoopmaster:8088/proxy/application_1468978056881_0003/

  14. KillCommand=/usr/local/hadoop/bin/hadoop job-kill job_1468978056881_0003

  15. Hadoopjob informationforStage-1:number of mappers:1;number of reducers:1

  16. 2016-07-2210:28:58,786Stage-1map=0%,reduce=0%

  17. 2016-07-2210:29:03,890Stage-1map=100%,reduce=0%,CumulativeCPU0.89sec

  18. 2016-07-2210:29:10,005Stage-1map=100%,reduce=100%,CumulativeCPU1.71sec

  19. MapReduceTotalcumulative CPU time:1seconds710msec

  20. EndedJob=job_1468978056881_0003

  21. MapReduceJobsLaunched:

  22. Stage-Stage-1:Map:1Reduce:1CumulativeCPU:1.71sec   HDFSRead:1987050HDFSWrite:106SUCCESS

  23. TotalMapReduceCPUTimeSpent:1seconds710msec

  24. OK

  25. WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases.

  26. +---------+--+

  27. |c0|

  28. +---------+--+

  29. |100000|

  30. +---------+--+

  31. 1row selected(17.757seconds)

  32. hive用Mapreduce引擎计算真心在速度上不行,10W用了17秒,比关系型数据库差不少,还是要用Spark呀

再我们需要了解如何用hive中的一次命令,我们可以这样用.

  1. [email protected]:~$ hive-e"LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data;"

  2. Loadingdata to tabledefault.u_data

  3. OK

  4. Timetaken:1.239seconds

最后写脚本

  1. #!/bin/bash

  2. for((c=1;c<=10;c++))

  3. do

  4. echo"正在写入第 $c 次数据..."

  5. hive-e"LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data;"

  6. wait

  7. done

插入完,检查查询成本

  1. 0:jdbc:hive2://hadoopmaster:10000/> select count(*) from u_data;

  2. WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases.

  3. QueryID=hadoop_20160722104633_18c3467d-9263-4785-8714-1570fc3bb9ae

  4. Totaljobs=1

  5. LaunchingJob1outof1

  6. Numberof reduce tasks determined at compile time:1

  7. Inorder to change the average loadfora reducer(inbytes):

  8. sethive.exec.reducers.bytes.per.reducer=<number>

  9. Inorder to limit the maximum number of reducers:

  10. sethive.exec.reducers.max=<number>

  11. Inorder toseta constant number of reducers:

  12. setmapreduce.job.reduces=<number>

  13. StartingJob=job_1468978056881_0009,TrackingURL=http://hadoopmaster:8088/proxy/application_1468978056881_0009/

  14. KillCommand=/usr/local/hadoop/bin/hadoop job-kill job_1468978056881_0009

  15. Hadoopjob informationforStage-1:number of mappers:1;number of reducers:1

  16. 2016-07-2210:46:39,037Stage-1map=0%,reduce=0%

  17. 2016-07-2210:46:46,190Stage-1map=100%,reduce=0%,CumulativeCPU1.82sec

  18. 2016-07-2210:46:52,310Stage-1map=100%,reduce=100%,CumulativeCPU2.67sec

  19. MapReduceTotalcumulative CPU time:2seconds670msec

  20. EndedJob=job_1468978056881_0009

  21. MapReduceJobsLaunched:

  22. Stage-Stage-1:Map:1Reduce:1CumulativeCPU:2.67sec   HDFSRead:77198770HDFSWrite:107SUCCESS

  23. TotalMapReduceCPUTimeSpent:2seconds670msec

  24. OK

  25. WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases.

  26. +----------+--+

  27. |c0|

  28. +----------+--+

  29. |3900000|

  30. +----------+--+

  31. 1row selected(20.173seconds)

  32.  
  33. 用了20秒,看起来Mapreduce的启动成本确实有点高了

  34.  
  35. [email protected]:~$ hdfs dfs-ls/user/hive/warehouse/u_data

  36. Found39items

  37. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:37/user/hive/warehouse/u_data/u.data

  38. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:38/user/hive/warehouse/u_data/u_copy_1.data

  39. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:40/user/hive/warehouse/u_data/u_copy_10.data

  40. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:40/user/hive/warehouse/u_data/u_copy_11.data

  41. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:41/user/hive/warehouse/u_data/u_copy_12.data

  42. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:42/user/hive/warehouse/u_data/u_copy_13.data

  43. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:42/user/hive/warehouse/u_data/u_copy_14.data

  44. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:42/user/hive/warehouse/u_data/u_copy_15.data

  45. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:42/user/hive/warehouse/u_data/u_copy_16.data

  46. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_17.data

  47. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_18.data

  48. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_19.data

  49. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_2.data

  50. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_20.data

  51. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_21.data

  52. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_22.data

  53. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:43/user/hive/warehouse/u_data/u_copy_23.data

  54. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:44/user/hive/warehouse/u_data/u_copy_24.data

  55. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:44/user/hive/warehouse/u_data/u_copy_25.data

  56. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:44/user/hive/warehouse/u_data/u_copy_26.data

  57. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:44/user/hive/warehouse/u_data/u_copy_27.data

  58. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:44/user/hive/warehouse/u_data/u_copy_28.data

  59. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:44/user/hive/warehouse/u_data/u_copy_29.data

  60. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_3.data

  61. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_30.data

  62. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_31.data

  63. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_32.data

  64. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_33.data

  65. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_34.data

  66. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_35.data

  67. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:45/user/hive/warehouse/u_data/u_copy_36.data

  68. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:46/user/hive/warehouse/u_data/u_copy_37.data

  69. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:46/user/hive/warehouse/u_data/u_copy_38.data

  70. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_4.data

  71. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_5.data

  72. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_6.data

  73. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_7.data

  74. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:39/user/hive/warehouse/u_data/u_copy_8.data

  75. -rwxrwxr-x2hadoop supergroup19791732016-07-2210:40/user/hive/warehouse/u_data/u_copy_9.data

三 外部表

1 创建外部表并载入数据

  1. 0:jdbc:hive2://hadoopmaster:10000/> create external table  u_data_external_table  (userid INT, movieid INT, rating INT, unixtime STRING) row format delimited fields terminated by '\t' lines terminated by '\n';

  2. OK

  3. Norows affected(0.047seconds)

  4.  
  5. 0:jdbc:hive2://hadoopmaster:10000/> show tables;

  6. OK

  7. +------------------------+--+

  8. |tab_name|

  9. +------------------------+--+

  10. |employees|

  11. |t_hive|

  12. |t_hive2|

  13. |u_data|

  14. |u_data_external_table|

  15. +------------------------+--+

  16. 5rows selected(0.036seconds)

2 导入数据

  1. hive-e"LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data;"

3 内部表与外部表区别

  1. 我用drop table命令删除刚才创建的二张表,一个内表一个外表之后结果是.

  2. [email protected]:~$ hdfs dfs-ls/user/hive/warehouse/

  3. Found5items

  4. drwxrwxr-x-hadoop supergroup02016-07-2017:25/user/hive/warehouse/employees

  5. drwxrwxr-x-hadoop supergroup02016-07-2115:52/user/hive/warehouse/fincials.db

  6. drwxrwxr-x-hadoop supergroup02016-07-2009:50/user/hive/warehouse/t_hive

  7. drwxrwxr-x-hadoop supergroup02016-07-2009:54/user/hive/warehouse/t_hive2

  8. drwxrwxr-x-hadoop supergroup02016-07-2211:04/user/hive/warehouse/u_data_external_table

  9. 内表的数据完全删除,而外表还有

最后归纳一下Hive中表与外部表的区别:

  • 在导入数据到外部表,数据并没有移动到自己的数据仓库目录下,也就是说外部表中的数据并不是由它自己来管理的!而表则不一样;

  • 在删除表的时候,Hive将会把属于表的元数据和数据全部删掉;而删除外部表的时候,Hive仅仅删除外部表的元数据,数据是不会删除的!那么,应该如何选择使用哪种表呢?在大多数情况没有太多的区别,因此选择只是个人喜好的问题。但是作为一个经验,如果所有处理都需要由Hive完成,那么你应该创建表,否则使用外部表!

四 分区表

  1. 0:jdbc:hive2://hadoopmaster:10000/> create table  u_data_partitioned_table  (userid INT, movieid INT, rating INT, unixtime STRING) partitioned by(day int) row format delimited fields terminated by '\t' lines terminated by '\n';

  2. OK

  3. Norows affected(0.256seconds)

  4. 0:jdbc:hive2://hadoopmaster:10000/>

  5. 0:jdbc:hive2://hadoopmaster:10000/> LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data_partitioned_table partition(day=20160101);

  6. Loadingdata to tabledefault.u_data_partitioned_table partition(day=20160101)

  7. OK

  8. Norows affected(0.424seconds)

  9. 0:jdbc:hive2://hadoopmaster:10000/>

  10. 100,000rows selected(4.653seconds)

  11. 0:jdbc:hive2://hadoopmaster:10000/> LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data_partitioned_table partition(day=20160101);

  12. Loadingdata to tabledefault.u_data_partitioned_table partition(day=20160101)

  13. OK

  14. Norows affected(0.424seconds)

  15. 0:jdbc:hive2://hadoopmaster:10000/> LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data_partitioned_table partition(day=20160102);

  16. Loadingdata to tabledefault.u_data_partitioned_table partition(day=20160102)

  17. OK

  18. Norows affected(0.499seconds)

  19. 0:jdbc:hive2://hadoopmaster:10000/>

  20.  
  21. [email protected]:~$ hdfs dfs-ls/user/hive/warehouse/u_data_partitioned_table

  22. Found2items

  23. drwxrwxr-x-hadoop supergroup02016-07-2213:51/user/hive/warehouse/u_data_partitioned_table/day=20160101

  24. drwxrwxr-x-hadoop supergroup02016-07-2213:51/user/hive/warehouse/u_data_partitioned_table/day=20160102

五 分桶表

  1. 0:jdbc:hive2://hadoopmaster:10000/> CREATE TABLE bucketed_data_user (userid INT, movieid INT, rating INT, unixtime STRING) CLUSTERED BY (userid) INTO 4 BUCKETS row format delimited fields terminated by '\t' lines terminated by '\n';

  2. OK

  3. Norows affected(0.045seconds)

  4. 0:jdbc:hive2://hadoopmaster:10000/>

  5. 0:jdbc:hive2://hadoopmaster:10000/> insert overwrite table bucketed_data_user select userid,movieid,rating,unixtime from u_data_partitioned_table;

  6. WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases.

  7. QueryID=hadoop_20160722140142_c272bc07-b74d-4b5b-9689-0bec2ce71780

  8. Totaljobs=1

  9. LaunchingJob1outof1

  10. Numberof reduce tasks determined at compile time:4

  11. Inorder to change the average loadfora reducer(inbytes):

  12. sethive.exec.reducers.bytes.per.reducer=<number>

  13. Inorder to limit the maximum number of reducers:

  14. sethive.exec.reducers.max=<number>

  15. Inorder toseta constant number of reducers:

  16. setmapreduce.job.reduces=<number>

  17. StartingJob=job_1468978056881_0010,TrackingURL=http://hadoopmaster:8088/proxy/application_1468978056881_0010/

  18. KillCommand=/usr/local/hadoop/bin/hadoop job-kill job_1468978056881_0010

  19. Hadoopjob informationforStage-1:number of mappers:1;number of reducers:4

  20. 2016-07-2214:01:48,774Stage-1map=0%,reduce=0%

  21. 2016-07-2214:01:55,978Stage-1map=100%,reduce=0%,CumulativeCPU1.89sec

  22. 2016-07-2214:02:06,236Stage-1map=100%,reduce=50%,CumulativeCPU5.66sec

  23. 2016-07-2214:02:07,272Stage-1map=100%,reduce=100%,CumulativeCPU9.43sec

  24. MapReduceTotalcumulative CPU time:9seconds430msec

  25. EndedJob=job_1468978056881_0010

  26. Loadingdata to tabledefault.bucketed_data_user

  27. MapReduceJobsLaunched:

  28. Stage-Stage-1:Map:1Reduce:4CumulativeCPU:9.43sec   HDFSRead:5959693HDFSWrite:5937879SUCCESS

  29. TotalMapReduceCPUTimeSpent:9seconds430msec

  30. OK

  31. WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases.

  32. Norows affected(26.251seconds)

  33. 0:jdbc:hive2://hadoopmaster:10000/>

  34. 0:jdbc:hive2://hadoopmaster:10000/> select count(*) from bucketed_data_user ;

  35. WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases.

  36. QueryID=hadoop_20160722141056_eaf582be-4107-403a-bacd-0a18f567f576

  37. Totaljobs=1

  38. LaunchingJob1outof1

  39. Numberof reduce tasks determined at compile time:1

  40. Inorder to change the average loadfora reducer(inbytes):

  41. sethive.exec.reducers.bytes.per.reducer=<number>

  42. Inorder to limit the maximum number of reducers:

  43. sethive.exec.reducers.max=<number>

  44. Inorder toseta constant number of reducers:

  45. setmapreduce.job.reduces=<number>

  46. StartingJob=job_1468978056881_0012,TrackingURL=http://hadoopmaster:8088/proxy/application_1468978056881_0012/

  47. KillCommand=/usr/local/hadoop/bin/hadoop job-kill job_1468978056881_0012

  48. Hadoopjob informationforStage-1:number of mappers:1;number of reducers:1

  49. 2016-07-2214:11:04,156Stage-1map=0%,reduce=0%

  50. 2016-07-2214:11:09,331Stage-1map=100%,reduce=0%,CumulativeCPU0.94sec

  51. 2016-07-2214:11:15,488Stage-1map=100%,reduce=100%,CumulativeCPU1.78sec

  52. MapReduceTotalcumulative CPU time:1seconds780msec

  53. EndedJob=job_1468978056881_0012

  54. MapReduceJobsLaunched:

  55. Stage-Stage-1:Map:1Reduce:1CumulativeCPU:1.78sec   HDFSRead:5945855HDFSWrite:106SUCCESS

  56. TotalMapReduceCPUTimeSpent:1seconds780msec

  57. OK

  58. WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbe availableinthe future versions.Considerusinga different execution engine(i.e.tez,spark)orusingHive1.Xreleases.

  59. +---------+--+

  60. |c0|

  61. +---------+--+

  62. |300000|

  63. +---------+--+

  64. 1row selected(20.397seconds)

  65. 0:jdbc:hive2://hadoopmaster:10000/>

  66. [email protected]:~$ hdfs dfs-ls/user/hive/warehouse/bucketed_data_user

  67. Found4items

  68. -rwxrwxr-x2hadoop supergroup14009942016-07-2214:02/user/hive/warehouse/bucketed_data_user/000000_0

  69. -rwxrwxr-x2hadoop supergroup14938562016-07-2214:02/user/hive/warehouse/bucketed_data_user/000001_0

  70. -rwxrwxr-x2hadoop supergroup15667382016-07-2214:02/user/hive/warehouse/bucketed_data_user/000002_0

  71. -rwxrwxr-x2hadoop supergroup14759312016-07-2214:02/user/hive/warehouse/bucketed_data_user/000003_0

相关文章:

  • 2021-11-02
  • 2021-05-24
  • 2021-11-21
  • 2021-07-01
  • 2021-10-22
  • 2022-12-23
  • 2022-12-23
  • 2022-12-23
猜你喜欢
  • 2021-06-19
  • 2022-12-23
  • 2022-12-23
  • 2021-09-05
  • 2022-01-10
  • 2021-09-03
相关资源
相似解决方案