如何有效地将 Google BigTable 中的行读入 pandas DataFrame答案

【问题标题】：How to efficiently read rows from Google BigTable into a pandas DataFrame如何有效地将 Google BigTable 中的行读入 pandas DataFrame
【发布时间】：2018-07-27 11:13:07
【问题描述】：

用例：

我正在使用 Google BigTable 来存储这样的计数：

|  rowkey  |    columnfamily    |
|          | col1 | col2 | col3 |
|----------|------|------|------|
| row1     | 1    | 2    | 3    |
| row2     | 2    | 4    | 8    |
| row3     | 3    | 3    | 3    |

我想读取给定行键范围内的所有行（在这种情况下假设所有行）并聚合每列的值。

一个简单的实现会查询行并在聚合计数时迭代行，如下所示：

from google.cloud.bigtable import Client

instance = Client(project='project').instance('my-instance')
table = instance.table('mytable')

col1_sum = 0
col2_sum = 0
col3_max = 0

table.read_rows()
row_data.consume_all()

for row in row_data.rows:
    col1_sum += int.from_bytes(row['columnfamily']['col1'.encode('utf-8')][0].value(), byteorder='big')
    col2_sum += int.from_bytes(row['columnfamily']['col2'.encode('utf-8')][0].value(), byteorder='big')
    col3_value = int.from_bytes(row['columnfamily']['col3'.encode('utf-8')][0].value(), byteorder='big')
    col3_max = col3_value if col3_value > col3_max else col3_max

问题：

有没有办法有效地将结果行加载到 pandas DataFrame 中并利用 pandas 性能进行聚合？

我想避免使用 for 循环来计算聚合，因为众所周知它效率非常低。

我知道Apache Arrow project 和它的python bindings，尽管 HBase 被提及为支持项目（Google BigTable 被宣传为与 HBase 非常相似）我似乎找不到使用它的方法对于我在这里描述的用例。

【问题讨论】：

标签： python pandas bigtable pyarrow

【解决方案1】：

在深入了解 BigTable 机制之后，当您调用 table.read_rows() 时，python 客户端似乎会执行 gRPC ReadRows 调用。该 gRPC 调用通过 HTTP/2（参见 docs）返回按密钥顺序的行的流式响应。

如果 API 按行返回数据，在我看来，使用该响应的唯一有用方法是基于行。尝试以列格式加载该数据以避免不得不遍历行似乎没有什么用处。

【讨论】：

【解决方案2】：

您也许可以将pdhbase 与google-cloud-happybase 一起使用。如果这不能开箱即用，您或许可以从如何执行集成方面获得灵感。

还有一个 Cloud Bigtable / BigQuery integration，您可以将其与 https://github.com/pydata/pandas-gbq 集成（感谢 Wes McKinney 的提示）。

【讨论】：

【解决方案3】：

我认为 Cloud Bigtable 没有现有的 pandas 接口，但这将是一个不错的构建项目，类似于 https://github.com/pydata/pandas-gbq 中的 BigQuery 接口。

【讨论】：