【问题标题】：API config for BigQuery Federated Data SourceBigQuery 联合数据源的 API 配置
【发布时间】：2018-03-22 16:19:09
【问题描述】：

我有以下配置，可以很好地将一堆文件加载到 BigQuery：

config= {
  'configuration'=> {
    'load'=> {
      'sourceUris'=> 'gs://my_bucket/my_files_*',
      'schema'=> {
        'fields'=> fields_array
      },
      'schemaUpdateOptions' => [{ 'ALLOW_FIELD_ADDITION'=> true}],  
      'destinationTable'=> {
        'projectId'=> 'my_project',
        'datasetId'=> 'my_dataset',
        'tableId'=> 'my_table'
      },
      'sourceFormat' => 'NEWLINE_DELIMITED_JSON',
      'createDisposition' => 'CREATE_IF_NEEDED',
      'writeDisposition' => 'WRITE_TRUNCATE',
      'maxBadRecords'=> 0,
    }
  },
}

然后在 client 预初始化的地方执行以下操作：

result = client.execute(
  api_method: big_query.jobs.insert,
  parameters: { projectId: 'my_project', datasetId: 'my_dataset' },
  body_object: config
)

我现在正在尝试编写等效的代码来创建external / federated data source，而不是加载数据。我需要这样做以有效地为 ETL 目的创建临时表。我已经使用 BigQuery UI 成功完成了这项工作，但需要在代码中运行，因为它最终将成为一个日常自动化过程。我在使用 API 文档时遇到了一些麻烦，找不到任何好的示例可供参考。任何人都可以帮忙吗？提前致谢！

【问题讨论】：

标签： ruby-on-rails ruby google-cloud-platform google-bigquery google-cloud-storage

【解决方案1】：

创建外部数据源是指创建引用外部数据源的表吗？在这种情况下，您可以使用 bigquery.tables.insert 并填写 externalDataConfiguraiton。然后可以在查询中使用该表从外部数据源中读取数据。

如果您只想在一个查询中使用外部数据源，您可以在查询中附加一个临时外部表，方法是将表定义放入tableDefinitions。在命令行中它看起来像这样：

bq query --external_table_definition=avroTable::AVRO=gs://path-to-avro 'SELECT * FROM avroTable'

【讨论】：

对，就是张华。您指向bigquery.tables.insert 的指针让我走上了正确的道路，我仍在尝试运行bigquery.jobs.insert。配置需要一些试验和错误才能正确，但我现在可以正常工作了！
很高兴知道你已经成功了。介意将其标记为答案吗？ :)

【解决方案2】：

尽可能使用惯用的 Cloud 库

使用 GCP 的 idiomatic Ruby client 中的 BigQuery 模块，它是普遍可用的，而不是 google-api-ruby-client，它同时处于“仅维护模式”和“alpha”。你可以找到这个推荐here和here。

身份验证：

您可以使用environment variables 定义项目和访问权限。

如何创建外部数据源对象

这是an example 使用bigquery.external 创建外部数据源。我对其稍作修改，以从您的解决方案中添加相关配置。

bigquery = Google::Cloud::Bigquery.new

json_url = "gs://my_bucket/my_files_*"
json_table = bigquery.external csv_url do |json|
  json.autodetect = true
  json.format = "json"
  json.max_bad_records = 10
end

对象配置方法为here。例如：autodetect、max_bad_records、urls等

如何查询：

data = bigquery.query "SELECT * FROM my_ext_table",
                      external: { my_ext_table: json_table }

data.each do |row|
  puts row[:name]
end

注意：另外，writeDisposition 和 createDisposition 都仅用于修改永久 BigQuery 表的加载/复制/查询作业，对于外部数据源没有多大意义。事实上，它们既没有出现在REST API reference 中，也没有出现在externalDataConfiguration 的"Try this API" section 中。

【讨论】：

【解决方案3】：

对于任何尝试相同的人，这就是我用来让它工作的方法。网上的工作示例并不多，文档需要一些破译，所以希望这对其他人有帮助！

config= {
  "kind": "bigquery#table",
  "tableReference": {
    "projectId": 'my_project',
    "datasetId": 'my_dataset',
    "tableId": 'my_table'
  },
  "externalDataConfiguration": {
    "autodetect": true,
    "sourceUris": ['gs://my_bucket/my_files_*'],
    'sourceFormat' => 'NEWLINE_DELIMITED_JSON',
    'maxBadRecords'=> 10,
  }
}

externalDataConfiguration 的文档可以在 BigQuery REST API reference 和 "Try this API" 部分中找到bigquery.tables.insert。

然后正如张华的回答中指出的那样，您运行 bigquery.tables.insert 而不是 bigquery.jobs.insert

result = client.execute(
    api_method: big_query.tables.insert,
    parameters: { projectId: my_project, datasetId: my_dataset },
    body_object: config
)

【讨论】：

writeDisposition 和 createDisposition 仅用于修改永久 BigQuery 表且对外部数据源没有多大意义的加载/复制/查询作业。事实上，它们既没有出现在 [REST API 参考][cloud.google.com/bigquery/docs/reference/rest/v2/… 中，也没有出现在 "Try this API" section for externalDataConfiguration 中。