Hive _分桶及抽样查询

分桶及抽样查询

1、分桶表数据存储

分区针对的是数据的存储路径；分桶针对的是数据文件。

分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区，特别是之前所提到过的要确定合适的划分大小这个疑虑。

分桶是将数据集分解成更容易管理的若干部分的另一个技术。

hive (default)> show databases;

OK
database_name
default
Time taken: 1.092 seconds, Fetched: 1 row(s)
hive (default)> create table stu_buck(id int, name string)
> clustered by(id)
> into 4 buckets
> row format delimited fields terminated by '\t';
OK
Time taken: 0.443 seconds
hive (default)> desc formatted stu_buck;
OK
col_name   data_type   comment
# col_name     data_type    comment

id     int
name     string

# Detailed Table Information
Database:    default
Owner:     root
CreateTime:    Sun Nov 03 03:49:59 CST 2019
LastAccessTime:    UNKNOWN
Protect Mode:    None
Retention:     0
Location:    hdfs://mycluster/user/hive/warehouse/stu_buck
Table Type:    MANAGED_TABLE
Table Parameters:
   transient_lastDdlTime   1572724199

# Storage Information
SerDe Library:     org.apache.hadoop.hive.serde2.lazy.LazySimpleSerD
InputFormat:     org.apache.hadoop.mapred.TextInputFormat
OutputFormat:    org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutp
utFormat   Compressed:    No
Num Buckets:     4
Bucket Columns:    [id]
Sort Columns:    []
Storage Desc Params:
   field.delim    \t
   serialization.format   \t
Time taken: 0.318 seconds, Fetched: 28 row(s)
hive (default)> load data local inpath "/root/student" into table stu_buck
;Loading data to table default.stu_buck
Table default.stu_buck stats: [numFiles=1, totalSize=54]
OK
Time taken: 1.82 seconds
hive (default)> create table stu(id int, name string)
> row format delimited fields terminated by '\t';
OK
Time taken: 0.123 seconds
hive (default)> load data local inpath "/root/student" into table stu;
Loading data to table default.stu
Table default.stu stats: [numFiles=1, totalSize=54]
OK
Time taken: 0.742 seconds
hive (default)> truncate table stu_buck;
OK
Time taken: 0.276 seconds
hive (default)> select * from stu_buck;
OK
stu_buck.id   stu_buck.name
Time taken: 0.453 seconds
hive (default)> insert into table stu_buck select id,name from stu;
Query ID = root_20191103035354_d7ff026a-7592-48a1-ba5c-f0d2cced5d46
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1572895740509_0001, Tracking URL = http://henu3:8088/pr
oxy/application_1572895740509_0001/Kill Command = /opt/hadoop-2.6.5/bin/hadoop job -kill job_1572895740509_0
001Hadoop job information for Stage-1: number of mappers: 1; number of reduce
rs: 02019-11-03 03:54:21,088 Stage-1 map = 0%, reduce = 0%
2019-11-03 03:54:42,550 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1
.63 secMapReduce Total cumulative CPU time: 1 seconds 630 msec
Ended Job = job_1572895740509_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://mycluster/user/hive/warehouse/stu_buck/.hive-stagin
g_hive_2019-11-03_03-53-54_155_2767196214915907307-1/-ext-10000Loading data to table default.stu_buck
Table default.stu_buck stats: [numFiles=1, numRows=7, totalSize=42, rawDat
aSize=35]MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.63 sec HDFS Read: 3233 HDFS Wr
ite: 114 SUCCESSTotal MapReduce CPU Time Spent: 1 seconds 630 msec
OK
id   name
Time taken: 51.336 seconds

并未分桶!!!

Hive _分桶及抽样查询

hive (default)> set hive.enforce.bucketing=true;【关键】
hive (default)> set mapreduce.job.reduces=-1;
hive (default)> insert into table stu_buck
> select id, name from stu;
Query ID = root_20191103035755_addfbe62-a68d-4e56-8f8b-bdc3fb44153b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1572895740509_0002, Tracking URL = http://henu3:8088/pr
oxy/application_1572895740509_0002/Kill Command = /opt/hadoop-2.6.5/bin/hadoop job -kill job_1572895740509_0
002Hadoop job information for Stage-1: number of mappers: 1; number of reduce
rs: 42019-11-03 03:58:21,998 Stage-1 map = 0%, reduce = 0%
2019-11-03 03:58:31,546 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2
.08 sec2019-11-03 03:58:49,596 Stage-1 map = 100%, reduce = 17%, Cumulative CPU
4.45 sec2019-11-03 03:58:51,903 Stage-1 map = 100%, reduce = 25%, Cumulative CPU
5.71 sec2019-11-03 03:58:53,050 Stage-1 map = 100%, reduce = 50%, Cumulative CPU
9.12 sec2019-11-03 03:58:54,155 Stage-1 map = 100%, reduce = 75%, Cumulative CPU
12.65 sec2019-11-03 03:58:56,572 Stage-1 map = 100%, reduce = 100%, Cumulative CPU
15.17 secMapReduce Total cumulative CPU time: 15 seconds 170 msec
Ended Job = job_1572895740509_0002
Loading data to table default.stu_buck
Table default.stu_buck stats: [numFiles=5, numRows=14, totalSize=84, rawDa
taSize=70]MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 15.17 sec HDFS Read:
15296 HDFS Write: 240 SUCCESSTotal MapReduce CPU Time Spent: 15 seconds 170 msec
OK
id name
Time taken: 64.242 seconds

Hive _分桶及抽样查询

分桶抽样查询

对于非常大的数据集，有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive可以通过对表进行抽样来满足这个需求。

查询表stu_buck中的数据。

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);

注：tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y) 。

y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。例如，table总共分了4份，当y=2时，抽取(4/2=)2个bucket的数据，当y=8时，抽取(4/8=)1/2个bucket的数据。

x表示从哪个bucket开始抽取，如果需要取多个分区，以后的分区号为当前分区号加上y。例如，table总bucket数为4，tablesample(bucket 1 out of 2)，表示总共抽取（4/2=）2个bucket的数据，抽取第1(x)个和第3(x+y)个bucket的数据。

注意：x的值必须小于等于y的值，否则

FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck