HBase Scan,Get用法

View Post

Scan,get用法

1. get help帮助信息

从下列get用法信息可以看出 get 后面可以跟table表名，rowkey,以及column，value.但是如果想通过get直接获取一个表中的全部数据是做不到的，这种情况就要用到另外一个命令scan。

hbase(main):214:0> help \'get\'
Get row or cell contents; pass table name, row, and optionally
a dictionary of column(s), timestamp, timerange and versions. Examples:

  hbase> get \'ns1:t1\', \'r1\'
  hbase> get \'t1\', \'r1\'
  hbase> get \'t1\', \'r1\', {TIMERANGE => [ts1, ts2]}
  hbase> get \'t1\', \'r1\', {COLUMN => \'c1\'}
  hbase> get \'t1\', \'r1\', {COLUMN => [\'c1\', \'c2\', \'c3\']}
  hbase> get \'t1\', \'r1\', {COLUMN => \'c1\', TIMESTAMP => ts1}
  hbase> get \'t1\', \'r1\', {COLUMN => \'c1\', TIMERANGE => [ts1, ts2], VERSIONS => 4}
  hbase> get \'t1\', \'r1\', {COLUMN => \'c1\', TIMESTAMP => ts1, VERSIONS => 4}
  hbase> get \'t1\', \'r1\', {FILTER => "ValueFilter(=, \'binary:abc\')"}
  hbase> get \'t1\', \'r1\', \'c1\'
  hbase> get \'t1\', \'r1\', \'c1\', \'c2\'
  hbase> get \'t1\', \'r1\', [\'c1\', \'c2\']
  hbsase> get \'t1\',\'r1\', {COLUMN => \'c1\', ATTRIBUTES => {\'mykey\'=>\'myvalue\'}}
  hbsase> get \'t1\',\'r1\', {COLUMN => \'c1\', AUTHORIZATIONS => [\'PRIVATE\',\'SECRET\']}

2. Scan help帮助信息

scan的用法很多，可以直接扫描全表信息也可以通过指定条件来显示我们所需要获取的数据。这里涉及到Filter的用法接下来会逐一演示

hbase(main):221:0> help \'scan\'
Scan a table; pass table name and optionally a dictionary of scanner
specifications.  Scanner specifications may include one or more of:
TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH,
or COLUMNS, CACHE

If no columns are specified, all columns will be scanned.
To scan all members of a column family, leave the qualifier empty as in
\'col_family:\'.

The filter can be specified in two ways:
1. Using a filterString - more information on this is available in the
Filter Language document attached to the HBASE-4176 JIRA
2. Using the entire package name of the filter.

Some examples:

  hbase> scan \'hbase:meta\'
  hbase> scan \'hbase:meta\', {COLUMNS => \'info:regioninfo\'}
  hbase> scan \'ns1:t1\', {COLUMNS => [\'c1\', \'c2\'], LIMIT => 10, STARTROW => \'xyz\'}
  hbase> scan \'t1\', {COLUMNS => [\'c1\', \'c2\'], LIMIT => 10, STARTROW => \'xyz\'}
  hbase> scan \'t1\', {COLUMNS => \'c1\', TIMERANGE => [1303668804, 1303668904]}
  hbase> scan \'t1\', {REVERSED => true}
  hbase> scan \'t1\', {FILTER => "(PrefixFilter (\'row2\') AND
    (QualifierFilter (>=, \'binary:xyz\'))) AND (TimestampsFilter ( 123, 456))"}
  hbase> scan \'t1\', {FILTER =>
    org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
For setting the Operation Attributes 
  hbase> scan \'t1\', { COLUMNS => [\'c1\', \'c2\'], ATTRIBUTES => {\'mykey\' => \'myvalue\'}}
  hbase> scan \'t1\', { COLUMNS => [\'c1\', \'c2\'], AUTHORIZATIONS => [\'PRIVATE\',\'SECRET\']}
For experts, there is an additional option -- CACHE_BLOCKS -- which
switches block caching for the scanner on (true) or off (false).  By
default it is enabled.  Examples:

  hbase> scan \'t1\', {COLUMNS => [\'c1\', \'c2\'], CACHE_BLOCKS => false}

Also for experts, there is an advanced option -- RAW -- which instructs the
scanner to return all cells (including delete markers and uncollected deleted
cells). This option cannot be combined with requesting specific COLUMNS.
Disabled by default.  Example:

  hbase> scan \'t1\', {RAW => true, VERSIONS => 10}

Besides the default \'toStringBinary\' format, \'scan\' supports custom formatting
by column.  A user can define a FORMATTER by adding it to the column name in
the scan specification.  The FORMATTER can be stipulated: 

 1. either as a org.apache.hadoop.hbase.util.Bytes method name (e.g, toInt, toString)
 2. or as a custom class followed by method name: e.g. \'c(MyFormatterClass).format\'.

Example formatting cf:qualifier1 and cf:qualifier2 both as Integers: 
  hbase> scan \'t1\', {COLUMNS => [\'cf:qualifier1:toInt\',
    \'cf:qualifier2:c(org.apache.hadoop.hbase.util.Bytes).toInt\'] } 

Note that you can specify a FORMATTER by column only (cf:qualifer).  You cannot
specify a FORMATTER for all columns of a column family.

Scan can also be used directly from a table, by first getting a reference to a
table, like such:

  hbase> t = get_table \'t\'
  hbase> t.scan

Note in the above situation, you can still provide all the filtering, columns,
options, etc as described above.

3. 通过get，Scan用法来获取表中指定rowkey信息。

1. get 获取table中rowkey语句 于 Scan获取table中rowkey语句
=================================================================================================================
【get】
hbase(main):011:0> get \'liupeng:employee\',\'1001\'
COLUMN                                  CELL
 contect:mail                           timestamp=1522202414649, value=liupliup@cn.ibm.com
 contect:phone                          timestamp=1522202430196, value=15962459503
 group:number                           timestamp=1522202455929, value=1
 info:age                               timestamp=1522202371257, value=34
 info:name                              timestamp=1522202364156, value=liupeng

【Scan】
hbase(main):010:0> scan \'liupeng:employee\',FILTER=>"PrefixFilter(\'1001\')"
ROW                                     COLUMN+CELL
 1001                                   column=contect:mail, timestamp=1522202414649, value=liupliup@cn.ibm.com
 1001                                   column=contect:phone, timestamp=1522202430196, value=15962459503
 1001                                   column=group:number, timestamp=1522202455929, value=1
 1001                                   column=info:age, timestamp=1522202371257, value=34
 1001                                   column=info:name, timestamp=1522202364156, value=liupeng
1 row(s) in 0.0590 seconds

总结：从上述两种不同的方法可以看出Scan的结果包含了rowkey本身。而get获取到的信息不包含rowkey的值。另外get的column于cell是分开的。而Scan是2者结合在一起的。
     另外Scan中FILTER过滤“PrefixFilter”关键字是用来筛选rowkey的。

4. get于Scan获取table中单条数据信息中的区别
《相同点》

hbase(main):229:0> get "liupeng:employee",\'1001\',\'info:phone\'
COLUMN                          CELL                                                                                     
 info:phone                     timestamp=1527914569028, value=15962459503                                               
1 row(s) in 0.0320 seconds

hbase(main):230:0> scan "liupeng:employee",FILTER=>"PrefixFilter(\'1001\')AND ValueFilter(=,\'substring:159\')"
ROW                             COLUMN+CELL                                                                              
 1001                           column=info:phone, timestamp=1527914569028, value=15962459503                            
1 row(s) in 0.1010 seconds

《不同点》
##注意事项：上述都可以把table中rowkey为1002，元素为\'159\'的信息查询出来。但是查询的方式截然不同。get是通过指定固定的value \'contect:phone\'来获取到的。
而scan是通过PerfixFilter指定固定的rowkey,然后通过AND条件语句结合ValueFilter指定模糊查询的字符串159查出来的。如果不知道对应的value是contect:phone的基础上
显然Scan这种模糊查询的方式更加高效。

另外Scan下面这种相同语句的查询用get语法是做不到的。例如：
=================================================================================================================

hbase(main):026:0> scan \'liupeng:employee\',FILTER=>"ValueFilter(=,\'substring:159\')"
ROW                                     COLUMN+CELL
 1001                                   column=contect:phone, timestamp=1522202430196, value=15962459503
 1002                                   column=contect:phone, timestamp=1522202527866, value=15977634464

##解释：上述是通过模糊查询直接找到了只要包含159的字段的值就全部显示出来。而get的语法如下所视必须指定rowkey的基础上才可以查询columns。这就需要对rowkey定义的时候
考虑全面的涉及才可以做到。因此从这点来看Scan的方法个人认为比get获取信息更加的便捷。

 hbase> t.get \'r1\'
  hbase> t.get \'r1\', {TIMERANGE => [ts1, ts2]}
  hbase> t.get \'r1\', {COLUMN => \'c1\'}
  hbase> t.get \'r1\', {COLUMN => [\'c1\', \'c2\', \'c3\']}
  hbase> t.get \'r1\', {COLUMN => \'c1\', TIMESTAMP => ts1}
  hbase> t.get \'r1\', {COLUMN => \'c1\', TIMERANGE => [ts1, ts2], VERSIONS => 4}
  hbase> t.get \'r1\', {COLUMN => \'c1\', TIMESTAMP => ts1, VERSIONS => 4}
  hbase> t.get \'r1\', {FILTER => "ValueFilter(=, \'binary:abc\')"}
  hbase> t.get \'r1\', \'c1\'
  hbase> t.get \'r1\', \'c1\', \'c2\'
  hbase> t.get \'r1\', [\'c1\', \'c2\']

5. Scan方法可以不用指定rowkey检索的情况下直接找valuse值。更具体点说也就是我们要找的哪个column中的哪个value值。get方法是无法做到这一点的。

ColumnPrefixFilter(\'列名\')

hbase(main):038:0> scan \'liupeng:employee\',FILTER=>"ColumnPrefixFilter(\'name\')"
ROW                                     COLUMN+CELL
 1001                                   column=info:name, timestamp=1522202364156, value=liupeng
 1002                                   column=info:name, timestamp=1522202474669, value=Jack_Ma
 1003                                   column=info:name, timestamp=1522202561029, value=kevin_shi
3 row(s) in 0.0210 seconds

##注释：ColumnPrefixFilter代表指定具体哪一个column（key（info）对应的value(name)）。

6. Scan方法方便在于它可以随意指定rowkey，column以及value的值来进行查找。还可以结合AND,ORD等条件语句并用来找到自己想要的数据。
下列语法是AND及OR的连用方法。但是同一条语句中相同的条件语句不可以同时使用。例如AND ....AND..这种方法是不允许的。

hbase(main):060:0> scan \'liupeng:employee\',FILTER=>"ColumnPrefixFilter(\'ph\')AND ValueFilter(=,\'substring:15962\')OR ValueFilter(=,\'substring:186\')"
ROW                                                  COLUMN+CELL
 1001                                                column=contect:phone, timestamp=1522202430196, value=15962459503
 1003                                                column=contect:phone, timestamp=1522202605976, value=18665851263
2 row(s) in 0.0170 seconds

7. 通过SingleColumnValueFilter类方法指定检索值列举出检索值对应的所有列及value数据

hbase(main):242:0> scan "liupeng:employee",{FILTER=>"SingleColumnValueFilter(\'info\',\'age\',=,\'substring:30\')"}
ROW                             COLUMN+CELL                                                                              
 1005                           column=contect:mail, timestamp=1528420218800, value=zhangsan@163.com                     
 1005                           column=info:age, timestamp=1528439967493, value=30                                       
 1005                           column=info:name, timestamp=1528420218800, value=zhangsan                                
 1008                           column=contect:mail, timestamp=1528681786126, value=www.kevin@alibaba.com                
 1008                           column=info:age, timestamp=1528681786126, value=30                                       
 1008                           column=info:name, timestamp=1528681786126, value=kevin                                   
2 row(s) in 0.0110 seconds

8. SingleColumnValueFilter类还提供正则表达式查询方法。可以通过模糊查询来查找对应的rowkeys,columns以及values。

hbase(main):244:0> scan "liupeng:employee",{FILTER=>"SingleColumnValueFilter(\'info\',\'name\',=,\'regexstring:liu\')"}
ROW                             COLUMN+CELL                                                                              
 1001                           column=contect:mail, timestamp=1527231141046, value=liupliup@cn.ibm.com                  
 1001                           column=info:address, timestamp=1527753987327, value=shanghai                             
 1001                           column=info:age, timestamp=1527231097033, value=34                                       
 1001                           column=info:name, timestamp=1527231081262, value=liupeng                                 
 1001                           column=info:phone, timestamp=1527914569028, value=15962459503                            
 1004                           column=contect:mail, timestamp=1527473497956, value=lqdong@jingdong.com                  
 1004                           column=info:address, timestamp=1527755135174, value=shenzhen                             
 1004                           column=info:age, timestamp=1527473477124, value=40                                       
 1004                           column=info:name, timestamp=1527415665182, value=liuqiangdong                            
2 row(s) in 0.0080 seconds