【发布时间】:2016-05-31 21:05:39
【问题描述】:
我们的一些 Dataflow 作业在读取源数据文件时随机崩溃。
作业日志中写入以下错误(工人日志中没有任何内容):
11 févr. 2016 à 08:30:54
(33b59f945cff28ab): Workflow failed.
Causes: (fecf7537c059fece): S02:read-edn-file2/TextIO.Read+read-edn-file2
/ParDo(ff19274a)+ParDo(ff19274a)5+ParDo(ff19274a)6+RemoveDuplicates
/CreateIndex+RemoveDuplicates/Combine.PerKey
/GroupByKey+RemoveDuplicates/Combine.PerKey/Combine.GroupedValues
/Partial+RemoveDuplicates/Combine.PerKey/GroupByKey/Reify+RemoveDuplicates
/Combine.PerKey/GroupByKey/Write faile
我们有时也会遇到这种错误(记录在工作人员日志中):
2016-02-15T10:27:41.024Z: Basic: S18: (43c8777b75bc373e): Executing operation group-by2/GroupByKey/Read+group-by2/GroupByKey/GroupByWindow+ParDo(ff19274a)19+ParDo(ff19274a)20+ParDo(ff19274a)21+write-edn-file3/ParDo(ff19274a)+write-bq-table-from-clj3/ParDo(ff19274a)+write-bq-table-from-clj3/BigQueryIO.Write+write-edn-file3/TextIO.Write
2016-02-15T10:28:03.994Z: Error: (af73c53187b7243a): java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
"code" : 503,
"errors" : [ {
"domain" : "global",
"message" : "Backend Error",
"reason" : "backendError"
} ],
"message" : "Backend Error"
}
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289)
at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.finish(WriteOperation.java:100)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:254)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:191)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:144)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:180)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:161)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:148)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
源数据文件存储在谷歌云存储中。
数据路径正确,并且该作业通常在重新启动后工作。 直到 1 月底,我们才遇到此问题。
使用以下参数启动作业: --tempLocation='gstoragelocation' --stagingLocation='another gstorage location' --runner=BlockingDataflowPipelineRunner --numWorkers='几十' --zone=europe-west1-d
SDK 版本:1.3.0
谢谢
【问题讨论】:
-
抱歉给您添麻烦了。我们目前正在与 Google Cloud Storage 团队一起调查此问题和类似问题。您能否提供一个失败的作业 ID 示例?
-
2/10 第一份工作遇到的问题应该在本周得到解决。如果您再次看到它,请告诉我们。第二种类型的错误是否会导致作业失败,或者它是否足够短暂,以至于捆绑包在重试时会成功?
-
感谢您的回答。今天早上我们的一项工作再次失败:2016-02-21_23_00_17-5627071082821060268。该错误会导致作业即使有重试也会失败,但如果手动重新启动作业通常会成功(针对第一类和第二类错误)
-
另一个刚刚失败的作业示例:2016-02-22_02_13_22-5788209240587963563。这个的工人日志几乎是空的。
-
谢谢,皮埃尔。我们正在继续调查。