分桶及抽样查询

1、分桶表数据存储

分区针对的是数据的存储路径;分桶针对的是数据文件。

分区提供一个隔离数据和优化查询的便利方式。不过,并非所有的数据集都可形成合理的分区,特别是之前所提到过的要确定合适的划分大小这个疑虑。

分桶是将数据集分解成更容易管理的若干部分的另一个技术。

hive (default)> show databases;

OK
database_name
default
Time taken: 1.092 seconds, Fetched: 1 row(s)
hive (default)> create table stu_buck(id int, name string)
              > clustered by(id) 
              > into 4 buckets
              > row format delimited fields terminated by '\t';

OK
Time taken: 0.443 seconds
hive (default)> desc formatted stu_buck;
OK
col_name    data_type    comment
# col_name                data_type               comment             
          
id                      int                                         
name                    string                                      
          
# Detailed Table Information          
Database:               default                  
Owner:                  root                     
CreateTime:             Sun Nov 03 03:49:59 CST 2019     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://mycluster/user/hive/warehouse/stu_buck     
Table Type:             MANAGED_TABLE            
Table Parameters:          
    transient_lastDdlTime    1572724199          
          
# Storage Information          
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerD 
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutp
utFormat     Compressed:             No                       
Num Buckets:            4                        
Bucket Columns:         [id]                     
Sort Columns:           []                       
Storage Desc Params:          
    field.delim             \t                  
    serialization.format    \t                  
Time taken: 0.318 seconds, Fetched: 28 row(s)
hive (default)> load data local inpath "/root/student" into table stu_buck
;Loading data to table default.stu_buck
Table default.stu_buck stats: [numFiles=1, totalSize=54]
OK
Time taken: 1.82 seconds
hive (default)> create table stu(id int, name string)
              > row format delimited fields terminated by '\t';

OK
Time taken: 0.123 seconds
hive (default)> load data local inpath "/root/student" into table stu;
Loading data to table default.stu
Table default.stu stats: [numFiles=1, totalSize=54]
OK
Time taken: 0.742 seconds
hive (default)> truncate table stu_buck;
OK
Time taken: 0.276 seconds
hive (default)> select * from stu_buck;
OK
stu_buck.id    stu_buck.name
Time taken: 0.453 seconds
hive (default)> insert into table stu_buck select id,name from stu;
Query ID = root_20191103035354_d7ff026a-7592-48a1-ba5c-f0d2cced5d46
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1572895740509_0001, Tracking URL = http://henu3:8088/pr
oxy/application_1572895740509_0001/Kill Command = /opt/hadoop-2.6.5/bin/hadoop job  -kill job_1572895740509_0
001Hadoop job information for Stage-1: number of mappers: 1; number of reduce
rs: 02019-11-03 03:54:21,088 Stage-1 map = 0%,  reduce = 0%
2019-11-03 03:54:42,550 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1
.63 secMapReduce Total cumulative CPU time: 1 seconds 630 msec
Ended Job = job_1572895740509_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://mycluster/user/hive/warehouse/stu_buck/.hive-stagin
g_hive_2019-11-03_03-53-54_155_2767196214915907307-1/-ext-10000Loading data to table default.stu_buck
Table default.stu_buck stats: [numFiles=1, numRows=7, totalSize=42, rawDat
aSize=35]MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.63 sec   HDFS Read: 3233 HDFS Wr
ite: 114 SUCCESSTotal MapReduce CPU Time Spent: 1 seconds 630 msec
OK
id    name
Time taken: 51.336 seconds

并未分桶!!!

Hive _分桶及抽样查询
 


hive (default)> set hive.enforce.bucketing=true;【关键】
hive (default)> set mapreduce.job.reduces=-1;
hive (default)> insert into table stu_buck
              > select id, name from stu;

Query ID = root_20191103035755_addfbe62-a68d-4e56-8f8b-bdc3fb44153b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1572895740509_0002, Tracking URL = http://henu3:8088/pr
oxy/application_1572895740509_0002/Kill Command = /opt/hadoop-2.6.5/bin/hadoop job  -kill job_1572895740509_0
002Hadoop job information for Stage-1: number of mappers: 1; number of reduce
rs: 42019-11-03 03:58:21,998 Stage-1 map = 0%,  reduce = 0%
2019-11-03 03:58:31,546 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2
.08 sec2019-11-03 03:58:49,596 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 
4.45 sec2019-11-03 03:58:51,903 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 
5.71 sec2019-11-03 03:58:53,050 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 
9.12 sec2019-11-03 03:58:54,155 Stage-1 map = 100%,  reduce = 75%, Cumulative CPU 
12.65 sec2019-11-03 03:58:56,572 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
 15.17 secMapReduce Total cumulative CPU time: 15 seconds 170 msec
Ended Job = job_1572895740509_0002
Loading data to table default.stu_buck
Table default.stu_buck stats: [numFiles=5, numRows=14, totalSize=84, rawDa
taSize=70]MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 15.17 sec   HDFS Read: 
15296 HDFS Write: 240 SUCCESSTotal MapReduce CPU Time Spent: 15 seconds 170 msec
OK
id    name
Time taken: 64.242 seconds

Hive _分桶及抽样查询


 

 分桶抽样查询

对于非常大的数据集,有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive可以通过对表进行抽样来满足这个需求。

查询表stu_buck中的数据。

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);

注:tablesample是抽样语句,语法:TABLESAMPLE(BUCKET x OUT OF y) 。

y必须是table总bucket数的倍数或者因子。hive根据y的大小,决定抽样的比例。例如,table总共分了4份,当y=2时,抽取(4/2=)2个bucket的数据,当y=8时,抽取(4/8=)1/2个bucket的数据。

x表示从哪个bucket开始抽取,如果需要取多个分区,以后的分区号为当前分区号加上y。例如,table总bucket数为4,tablesample(bucket 1 out of 2),表示总共抽取(4/2=)2个bucket的数据,抽取第1(x)个和第3(x+y)个bucket的数据。

注意:x的值必须小于等于y的值,否则

FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

 

相关文章:

  • 2021-09-06
  • 2022-12-23
  • 2021-10-26
  • 2022-03-02
  • 2021-10-01
  • 2022-12-23
  • 2021-11-04
猜你喜欢
  • 2021-11-04
  • 2022-12-23
  • 2022-12-23
  • 2021-05-16
  • 2021-11-04
  • 2021-11-07
相关资源
相似解决方案