大数据行式存储和列式存储比较

压缩率和压缩速度成反比:

压缩比:bzip2 > gzip > lzo > snappy ,压缩速度:snappy > lzo> gzip > bzip2

压缩以及解压是高消耗cpu的过程,故若机器的负载很高时就不能使用压缩,资源不够可通过扩容快速解决 

Hive中建表(列式+压缩)语句:

(1)orc格式

#Hive中默认压缩是zlib,写不写都一样

create table page_views_orc_zlib
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
STORED AS ORC 
TBLPROPERTIES("orc.compress"="ZLIB")
as select * from page_views;

create table page_views_orc_snappy
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
STORED AS ORC 
TBLPROPERTIES("orc.compress"="SNAPPY")
as select * from page_views;

(2)parquent格式
create table page_views_parquet
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
STORED AS PARQUET 
as select * from page_views;


set parquet.compression=gzip;
create table page_views_parquet_gzip
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
STORED AS PARQUET 
as select * from page_views;
 

相关文章: