【发布时间】:2016-12-19 03:53:25
【问题描述】:
我从 8 月发现了一个类似的 SO 问题,这或多或少是我的团队最近在使用我们的数据流管道时遇到的问题。 How to recover from Cloud Dataflow job failed on com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
这是异常(在大约 1 小时内抛出了几个 410 异常,但我只粘贴最后一个)
(9f012f4bc185d790): java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
"code" : 500,
"errors" : [ {
"domain" : "global",
"message" : "Backend Error",
"reason" : "backendError"
} ],
"message" : "Backend Error"
}
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289)
at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.finish(WriteOperation.java:97)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:80)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:287)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:223)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:173)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:193)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:173)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:160)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Suppressed: java.nio.channels.ClosedChannelException
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.throwIfNotOpen(AbstractGoogleAsyncWriteChannel.java:408)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:286)
at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.abort(WriteOperation.java:112)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:86)
... 10 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
"code" : 500,
"errors" : [ {
"domain" : "global",
"message" : "Backend Error",
"reason" : "backendError"
} ],
"message" : "Backend Error"
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
... 4 more
2016-12-18 (18:58:58) Workflow failed. Causes: (8e88b50b0d86156a): S26:up:2016-12-18:userprofile-lifetime-20161218/Write/D...
(d3b59c20088d726e): Workflow failed. Causes: (8e88b50b0d86156a): S26:up:2016-12-18:userprofile-lifetime-20161218/Write/DataflowPipelineRunner.BatchWrite/Finalize failed., (2f2e396b6ba3aaa2): A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on: userprofile-lifetime-diff-12181745-c04d-harness-0xq5, userprofile-lifetime-diff-12181745-c04d-harness-adga, userprofile-lifetime-diff-12181745-c04d-harness-cz30, userprofile-lifetime-diff-12181745-c04d-harness-lt9m
这是工作编号:2016-12-18_17_45_23-2873252511714037422
根据我之前提到的另一个 SO 问题的答案,我正在使用指定的分片数量重新运行相同的作业(为 4000,因为该作业每天运行并且通常输出 ~4k 文件)。 将分片数量限制在 10k 以下有帮助吗?如果需要,了解这一点对我们重新设计管道很有用。
此外,当指定了分片数量时,作业花费的时间比没有指定时要长得多(主要是因为写入 GCS 的步骤)——就 $'s 而言,这项作业通常花费 75 美元-80(我们每天运行这项工作),而当我指定分片数量时,它花费了 130 美元到 140 美元(增加了 74%)(其他步骤似乎运行了相同的持续时间,或多或少——作业 ID 为 2016-12-18_19_30_32-7274262445792076535)。因此,如果可能的话,我们真的希望避免指定分片的数量。
任何帮助和建议将不胜感激!
-- 跟进 当我在输出目录中尝试“gsutil ls”时,该作业的输出似乎正在消失,然后出现在 GCS 中,即使是在作业完成 10 多个小时后。这可能是一个相关问题,但我在这里创建了一个单独的问题 ("gsutil ls" shows a different list every time)。
【问题讨论】:
标签: google-cloud-storage google-cloud-dataflow