【发布时间】:2022-08-19 01:28:08
【问题描述】:
我在 python3.8 和 beam2.41.0rc1 上使用以下命令运行光束管道:
argv = [
\"--runner\", \"DataflowRunner\",
\"--experiments=use_runner_v2\",
\"--sdk_container_image=us.gcr.io/some_beam_image_based_on_2.41.0rc1\",
]
光束图像是使用 bazel docker 规则构建的:
在WORKSPACE
# https://hub.docker.com/r/apache/beam_python3.8_sdk/tags
container_pull(
name = \"beam_python\",
# 2.41.0rc1
digest = \"sha256:0036b90ecfefddd1dd1614b9cd1ccec7c5a906ee2185542996bc26d6408d9e14\",
registry = \"registry.hub.docker.com\",
repository = \"apache/beam_python3.8_sdk\",
)
在BUILD
cc_image(
name = \"sample_image\",
binary = \":sample\",
)
container_layer(
name = \"sample_layer\",
tars = [\":sample_image\"],
)
container_image(
name = \"beam_sample_image\",
base = \"@beam_python//image\",
layers = [\":sample_layer\"],
)
似乎安装了自定义 apache-beam。不确定它是否是 2.41.0rc1。
root@3dc8fe29cd99:/# pip freeze
absl-py==1.2.0
apache-beam @ file:///opt/apache/beam/tars/apache-beam.tar.gz
astunparse==1.6.3
atomicwrites==1.4.1
attrs==21.4.0
beautifulsoup4==4.11.1
...
我看到了以下日志:
I0815 18:11:40.158377 140374774146880 stager.py:927] Downloading source distribution of the SDK from PyPi
I0815 18:11:40.158492 140374774146880 stager.py:934] Executing command: [\'/home/swang/.cache/bazel/_bazel_swang/09eb83215bfa3a8425e4385b45dbf00d/execroot/__main__/bazel-out/k8-opt/bin/garage/sample_launch.runfiles/python3_8_x86_64-unknown-linux-gnu/bin/python3\', \'-m\', \'pip\', \'download\', \'--dest\', \'/tmp/tmpuqnjdrj3\', \'apache-beam==2.41.0rc1\', \'--no-deps\', \'--no-binary\', \':all:\']
WARNING: You are using pip version 22.0.4; however, version 22.2.2 is available.
You should consider upgrading via the \'/home/swang/.cache/bazel/_bazel_swang/09eb83215bfa3a8425e4385b45dbf00d/execroot/__main__/bazel-out/k8-opt/bin/garage/sample_launch.runfiles/python3_8_x86_64-unknown-linux-gnu/bin/python3 -m pip install --upgrade pip\' command.
I0815 18:11:42.881979 140374774146880 stager.py:825] Staging SDK sources from PyPI: dataflow_python_sdk.tar
I0815 18:11:42.883261 140374774146880 stager.py:900] Downloading binary distribution of the SDK from PyPi
I0815 18:11:42.883335 140374774146880 stager.py:934] Executing command: [\'/home/swang/.cache/bazel/_bazel_swang/09eb83215bfa3a8425e4385b45dbf00d/execroot/__main__/bazel-out/k8-opt/bin/garage/sample_launch.runfiles/python3_8_x86_64-unknown-linux-gnu/bin/python3\', \'-m\', \'pip\', \'download\', \'--dest\', \'/tmp/tmpuqnjdrj3\', \'apache-beam==2.41.0rc1\', \'--no-deps\', \'--only-binary\', \':all:\', \'--python-version\', \'38\', \'--implementation\', \'cp\', \'--abi\', \'cp38\', \'--platform\', \'manylinux1_x86_64\']
WARNING: You are using pip version 22.0.4; however, version 22.2.2 is available.
You should consider upgrading via the \'/home/swang/.cache/bazel/_bazel_swang/09eb83215bfa3a8425e4385b45dbf00d/execroot/__main__/bazel-out/k8-opt/bin/garage/sample_launch.runfiles/python3_8_x86_64-unknown-linux-gnu/bin/python3 -m pip install --upgrade pip\' command.
I0815 18:11:44.672350 140374774146880 stager.py:842] Staging binary distribution of the SDK from PyPI: apache_beam-2.41.0rc1-cp38-cp38-manylinux1_x86_64.whl
I0815 18:11:44.675273 140374774146880 dataflow_runner.py:477] Pipeline has additional dependencies to be installed in SDK worker container, consider using the SDK container image pre-building workflow to avoid repetitive installations. Learn more on https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild
I0815 18:11:44.676919 140374774146880 environments.py:376] Default Python SDK image for environment is apache/beam_python3.8_sdk:2.41.0rc1
I0815 18:11:44.677026 140374774146880 environments.py:295] Using provided Python SDK container image: us.gcr.io/shawn-295406/beam_sample:20220815_test
I0815 18:11:44.677081 140374774146880 environments.py:302] Python SDK container image set to \"us.gcr.io/shawn-295406/beam_sample:20220815_test\" for Docker environment
I0815 18:11:44.723044 140374774146880 translations.py:714] ==================== <function pack_combiners at 0x7fab84d77820> ====================
I0815 18:11:44.723375 140374774146880 translations.py:714] ==================== <function sort_stages at 0x7fab84d78040> ====================
I0815 18:11:44.730723 140374774146880 apiclient.py:473] Defaulting to the temp_location as staging_location: gs://shizhiw/beam/tmp
I0815 18:11:44.750272 140374774146880 auth.py:136] Setting socket default timeout to 60 seconds.
I0815 18:11:44.750348 140374774146880 auth.py:138] socket default timeout is 60.0 seconds.
I0815 18:11:44.755919 140374774146880 apiclient.py:732] Starting GCS upload to gs://shizhiw/beam/tmp/beamapp-swang-0816011144-730582-ppdswudf.1660612304.730851/dataflow_python_sdk.tar...
I0815 18:11:45.899281 140374774146880 apiclient.py:748] Completed GCS upload to gs://shizhiw/beam/tmp/beamapp-swang-0816011144-730582-ppdswudf.1660612304.730851/dataflow_python_sdk.tar in 1 seconds.
I0815 18:11:45.899615 140374774146880 apiclient.py:732] Starting GCS upload to gs://shizhiw/beam/tmp/beamapp-swang-0816011144-730582-ppdswudf.1660612304.730851/apache_beam-2.41.0rc1-cp38-cp38-manylinux1_x86_64.whl...
I0815 18:11:48.883744 140374774146880 apiclient.py:748] Completed GCS upload to gs://shizhiw/beam/tmp/beamapp-swang-0816011144-730582-ppdswudf.1660612304.730851/apache_beam-2.41.0rc1-cp38-cp38-manylinux1_x86_64.whl in 2 seconds.
I0815 18:11:48.884612 140374774146880 apiclient.py:732] Starting GCS upload to gs://shizhiw/beam/tmp/beamapp-swang-0816011144-730582-ppdswudf.1660612304.730851/pipeline.pb...
I0815 18:11:49.025467 140374774146880 apiclient.py:748] Completed GCS upload to gs://shizhiw/beam/tmp/beamapp-swang-0816011144-730582-ppdswudf.1660612304.730851/pipeline.pb in 0 seconds.
I0815 18:11:49.855348 140374774146880 apiclient.py:911] Create job: <Job
我对日志有点困惑:
- beam 已经安装在本地和容器镜像中,为什么好像又下载了?
- 我已经在使用自定义容器(基本光束图像 + cpp 二进制文件),为什么日志仍然建议我使用“预构建工作流...”?
-
你能分享你的Dockerfile吗?
-
更新。谢谢!
标签: google-cloud-dataflow apache-beam