【问题标题】:Dataflow batch template provided by google does not work谷歌提供的数据流批处理模板不起作用
【发布时间】:2021-04-11 21:30:26
【问题描述】:

我想运行 [1] 中的示例。
但是,当我这样做时,我收到以下错误:

org.apache.beam.sdk.Pipeline$PipelineExecutionException: org.apache.avro.UnresolvedUnionException: Not in union ["null",{"type":"int","logicalType":"date"}]: 1990-01-01 (field=birthday)
    at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish (DirectRunner.java:353)
    at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish (DirectRunner.java:321)
    at org.apache.beam.runners.direct.DirectRunner.run (DirectRunner.java:216)
    at org.apache.beam.runners.direct.DirectRunner.run (DirectRunner.java:67)
    at org.apache.beam.sdk.Pipeline.run (Pipeline.java:317)
    at org.apache.beam.sdk.Pipeline.run (Pipeline.java:303)
    at org.apache.beam.examples.Test.run (Test.java:299)
    at org.apache.beam.examples.Test.main (Test.java:232)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:566)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
    at java.lang.Thread.run (Thread.java:834)
Caused by: org.apache.avro.UnresolvedUnionException: Not in union ["null",{"type":"int","logicalType":"date"}]: 1990-01-01 (field=birthday)
    at org.apache.avro.generic.GenericDatumWriter.writeField (GenericDatumWriter.java:223)
    at org.apache.avro.generic.GenericDatumWriter.writeRecord (GenericDatumWriter.java:210)
    at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion (GenericDatumWriter.java:131)
    at org.apache.avro.generic.GenericDatumWriter.write (GenericDatumWriter.java:83)
    at org.apache.avro.generic.GenericDatumWriter.write (GenericDatumWriter.java:73)
    at org.apache.beam.sdk.coders.AvroCoder.encode (AvroCoder.java:317)
    at org.apache.beam.sdk.coders.Coder.encode (Coder.java:136)
    at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream (CoderUtils.java:82)
    at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray (CoderUtils.java:66)
    at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray (CoderUtils.java:51)
    at org.apache.beam.sdk.util.CoderUtils.clone (CoderUtils.java:141)
    at org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.<init> (MutationDetectors.java:115)
    at org.apache.beam.sdk.util.MutationDetectors.forValueWithCoder (MutationDetectors.java:46)
    at org.apache.beam.runners.direct.ImmutabilityCheckingBundleFactory$ImmutabilityEnforcingBundle.add (ImmutabilityCheckingBundleFactory.java:112)
    at org.apache.beam.runners.direct.ParDoEvaluator$BundleOutputManager.output (ParDoEvaluator.java:301)
    at org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.outputWindowedValue (SimpleDoFnRunner.java:267)
    at org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.access$900 (SimpleDoFnRunner.java:79)
    at org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner$DoFnProcessContext.output (SimpleDoFnRunner.java:413)
    at org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner$DoFnProcessContext.output (SimpleDoFnRunner.java:401)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$TypedRead$3.processElement (BigQueryIO.java:1139)

作为参考,avro的版本是1.10.1
有什么解决办法吗?

[1]https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/v2/bigquery-to-parquet/src/main/java/com/google/cloud/teleport/v2/templates/BigQueryToParquet.java

【问题讨论】:

  • 您使用的 avro 架构是什么?
  • 我很抱歉。我是avro的新手,所以我不知道如何回答你的问题。我在哪里可以检查以回答问题?
  • 当然,我也不知道您的架构在哪里。发生 apache 梁的已知问题。我认为将您的“生日”日期类型重新设计为 int 或 string 确实有帮助。
  • 我正在使用 Bigquery 表数据。我知道这是一个已知问题。当我引用一个不使用 DATE、TIME 和 RECORD 类型的表时,我已经验证它可以工作。如何使用 apachebeam 引用包含 DATE 类型、TIME 类型和 RECORD 类型的表并以 parquet 格式输出?如果您知道,请告诉我。

标签: google-cloud-platform google-bigquery google-cloud-dataflow apache-beam avro


【解决方案1】:

这看起来像一个错误。

  1. Avro 架构是从 BigQuery 中检索到的,它表明字段 birthday 应该是表示日期的可空整数。
  2. 写入时的实际数据是字符串或日期(无法从输出中看出)。

我浏览了Avro's code,似乎只在写入数据时抛出异常,而不是读取。因此,在写入 Parquet 期间似乎发生了这种情况,可能是因为从 Avro 文件加载的 GenericRecord 已转换 birthday 字段,因此它不再是整数。

您可以通过避免使用 DATE 类型来解决此问题。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-12-07
    • 2020-01-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多