scala_spark实践3

Spark 读写HBase优化

读数据

可以采用RDD的方式读取HBase数据：

val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, hTabName) //设置查询的表名
val rdd = sparkContext.newAPIHadoopRDD(
  conf,
  classOf[TableInputFormat],
  classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
  classOf[org.apache.hadoop.hbase.client.Result]
)

写数据

可以采用bulk的方式写数据：

val conf = HBaseConfiguration.create()
conf.set(TableOutputFormat.OutPUT_TABLE, hTabName) //设置要输出的表名
rdd.map({
    val put = new Put(Bytes.toBytes("行键"))
    ...
    (new ImmutableBytesWritable, put)           //转换成HBaseRDD的形式
}).saveAsNewAPIHadoopDataset(conf)

个人见解：

使用RDD的形式，Spark可能会事先建立与HBase的连接并广播到各个分区并行拉取数据。
使用bulk则是调用HBase原本具有的加载文件的工具：bulkLoad，通过事先转换成HFile文件，使得HBase可以跳过WAL日志机制和flush机制，直接将文件加载到存储中。