从非常大的 BigQuery 表中读取小批量的最佳方法？答案

【问题标题】：Best way to read minibatches from a very large BigQuery table?从非常大的 BigQuery 表中读取小批量的最佳方法？
【发布时间】：2020-01-29 01:38:58
【问题描述】：

我有一个大型（>2 亿行）BigQuery 表，我想从中读取小批量数据，以便训练机器学习模型。数据集太大而无法放入内存，因此我无法一次全部读取，但我希望我的模型能够从所有数据中学习。我还想避免由于网络延迟而发出过多的查询，因为这会减慢训练过程。在 Python 中执行此操作的最佳方法是什么？

【问题讨论】：

标签： python google-bigquery training-data

【解决方案1】：

如果您使用 TF，Felipe 的回答有效，但如果您使用 pytorch 或想要一些对您的训练平台更不可知的东西，faucetml 可能会很好：

https://github.com/econti/faucetml

根据文档中的示例，如果您要训练两个 epoch：

fml = get_client(
    datastore="bigquery",
    credential_path="bq_creds.json",
    table_name="my_training_table",
    ds="2020-01-20",
    epochs=2,
    batch_size=1024
    chunk_size=1024 * 10000,
    test_split_percent=20,
)
for epoch in range(2):
    fml.prep_for_epoch()
    batch = fml.get_batch()
    while batch is not None:
        train(batch)
        batch = fml.get_batch()

【讨论】：

【解决方案2】：

你在使用 Tensorflow 吗？

tfio.bigquery.BigQueryClient0.9.0 解决了这个问题：

read_session(
    parent,
    project_id,
    table_id,
    dataset_id,
    selected_fields,
    output_types=None,
    row_restriction='',
    requested_streams=1
)

与

requested_streams：初始流数。如果未设置或为 0，我们将提供一个流值以产生合理的吞吐量。必须是非负数。流的数量可能低于请求的数量，具体取决于表的合理并行量和系统允许的最大并行量。

https://www.tensorflow.org/io/api_docs/python/tfio/bigquery/BigQueryClient

源代码：

https://github.com/tensorflow/io/tree/master/tensorflow_io/bigquery

【讨论】：