Apache Beam / Google Cloud Dataflow 大查询阅读器第二次运行失败答案

【问题标题】：Apache Beam / Google Cloud Dataflow big-query reader failing from second runApache Beam / Google Cloud Dataflow 大查询阅读器第二次运行失败
【发布时间】：2021-01-27 20:37:32
【问题描述】：

我们使用 Apache Beam 构建了一个 Dataflow，并部署在 GCP Dataflow 基础架构中。数据流实例第一次运行完美，并按预期创建分区表，但是当它第二次运行时，它会从数据集中清除结果，而不是用该特定分区中的新数据集替换。使用本地设置中的 Direct runner 运行时，作业可以完美运行。

代码示例：

        pipeline.apply(
            "Read from BigQuery (table_name) Table: ",
            BigQueryIO.readTableRows()
                .fromQuery(
                    String.format(
                        "SELECT  %s FROM `%s.%s.%s`",
                        FIELDS.stream().collect(Collectors.joining(",")), project, dataset, table))
                .usingStandardSql()
                .withoutValidation()));
    PCollection<VideoPlacement.Placement> rows =
        tableRow.apply(
            "TableRows to BigQueryVideoPlacement.Placement",
            MapElements.into(TypeDescriptor.of(Model.class))
                .via(Model::fromTableRow));

如果知道我在这里缺少什么，请告诉我。提前致谢！

【问题讨论】：

标签： google-cloud-dataflow apache-beam apache-beam-io

【解决方案1】：

想通了！

这是我对模板化环境所做的更改：

            "Read from BigQuery (table_name) Table: ",
            BigQueryIO.readTableRows()
                .fromQuery(
                    String.format(
                        "SELECT  %s FROM `%s.%s.%s`",
                        FIELDS.stream().collect(Collectors.joining(",")), project, dataset, table))
                .usingStandardSql()
                .withoutValidation()
                .withTemplateCompatibility()));
    PCollection<VideoPlacement.Placement> rows =
        tableRow.apply(
            "TableRows to BigQueryVideoPlacement.Placement",
            MapElements.into(TypeDescriptor.of(Model.class))
                .via(Model::fromTableRow));

.withTemplateCompatibility()

请查看更多文档here

【讨论】：