【发布时间】:2018-09-05 10:52:31
【问题描述】:
我正在尝试使用 DataFlowRunner 执行以下操作:
- 从已分区的 BigQuery 表中读取数据(大量数据,但只获取最近两天的数据)
- 从 Pub/Sub 订阅中读取 JSON
- 通过一个公共键加入两个集合
- 将连接的集合插入到另一个 BigQuery 表中
我对 Apache Beam 非常陌生,所以我不能 100% 确定我想做的事情是否可行。
当我尝试加入两行时,我的问题出现了,在使用 CoGroupByKey 转换后,数据似乎永远不会同时到达,尽管窗口策略是相同的(30 秒固定窗口,窗口结束触发和丢弃触发窗格)。
我的一些相关代码块:
/* Getting the data and windowing */
PCollection<PubsubMessage> pubsub = p.apply("ReadPubSub sub",PubsubIO.readMessages().fromSubscription(SUB_ALIM_REC));
String query = /* The query */
PCollection<TableRow> bqData = p.apply("Reading BQ",BigQueryIO.readTableRows().fromQuery(query).usingStandardSql())
.apply(Window.<TableRow>into(FixedWindows.of(Duration.standardSeconds(30)))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.standardSeconds(0)).accumulatingFiredPanes());
PCollection<TableRow> tableRow = pubsub.apply(Window.<PubsubMessage>into(FixedWindows.of(Duration.standardSeconds(120)))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.standardSeconds(0)).accumulatingFiredPanes())
.apply("JSON to TableRow",ParDo.of(new ToTableRow()));
/* Join code */
PCollection<TableRow> finalResultCollection =
kvpCollection.apply("Join TableRows", ParDo.of(
new DoFn<KV<Long, CoGbkResult>, TableRow>() {
private static final long serialVersionUID = 6627878974147676533L;
@ProcessElement
public void processElement(ProcessContext c) {
KV<Long, CoGbkResult> e = c.element();
Long idPaquete = e.getKey();
Iterable<TableRow> it1 = e.getValue().getAll(packTag);
Iterable<TableRow> it2 = e.getValue().getAll(alimTag);
for(TableRow t1 : itPaq) {
for (TableRow t2 : itAlimRec) {
TableRow joinedRow = new TableRow();
/* set the required fields from each collection */
c.output(joinedRow);
}
}
}
}));
在过去的两天里我也收到了这个错误:
java.io.IOException: Failed to start reading from source: org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter@2808d228
com.google.cloud.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.start(WorkerCustomSources.java:783)
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:360)
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:193)
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:158)
com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:75)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1227)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:135)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:966)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.UnsupportedOperationException: BigQuery source must be split before being read
org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.createReader(BigQuerySourceBase.java:153)
org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$ResidualSource.advance(UnboundedReadFromBoundedSource.java:463)
org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$ResidualSource.access$300(UnboundedReadFromBoundedSource.java:442)
org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$Reader.advance(UnboundedReadFromBoundedSource.java:293)
org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$Reader.start(UnboundedReadFromBoundedSource.java:286)
com.google.cloud.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.start(WorkerCustomSources.java:778)
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:360)
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:193)
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:158)
com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:75)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1227)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:135)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:966)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
非常感谢您的指导,以了解我正在尝试做的事情是否可行,或者是否有其他方法可以解决这种情况。
【问题讨论】:
-
我不确定您的 BigQuery 结果是否需要一个定时窗口,是吗?也许你应该看看 Side Inputs beam.apache.org/documentation/programming-guide/#side-inputs
标签: java google-cloud-dataflow apache-beam