【发布时间】:2020-12-05 00:51:10
【问题描述】:
在为 spark 作业分析 yarn launch_container.sh 日志时,我对日志的某些部分感到困惑。 我会在这里一步一步指出这些问题
当您使用 spark-submit 在 YARN 上以集群模式提交具有 --pyfiles 和 --files 的 spark 作业时:
-
在 --files 中传递的配置文件,在 --pyfiles 中传递的可执行 python 文件正在上传到用户 hadoop 主目录下创建的 .sparkStaging 目录。 与这些文件一起,来自 $SPARK_HOME/python/lib 的 pyspark.zip 和 py4j-version_number.zip 也被复制 进入用户 hadoop 主目录下创建的 .sparkStaging 目录
-
在这个 launch_container.sh 被 yarn 触发之后,这将导出所有需要的环境变量。 如果我们在 .bash_profile 或在 shell 脚本或 spark_env.sh 中构建 spark-submit 作业时明确导出了任何内容,例如 PYSPARK_PYTHON,则默认值将替换为我们的值 正在提供
This PYSPARK_PYTHON is a path in my edge node. Then how a container launched in another node will be able to use this python version ? The default python version in data nodes of my cluster is 2.7.5. So without setting this pyspark_python , containers are using 2.7.5. But when I will set pyspark_python to 3.5.x , they are using what I have given. -
它正在定义 PWD='/data/complete-path'
Where this PWD directory resides ? This directory is getting cleaned up after job completion. I have even tried to run the job in one session of putty and kept the /data folder opened in another session of putty to see if any directories are getting created on run time. but couldn't find any? -
它还将 PYTHONPATH 设置为 $PWD/pyspark.zip:$PWD/py4j-version.zip
When ever I am doing a python specific operation in spark code , its using PYSPARK_PYTHON. So for what purpose this PYTHONPATH is being used?
3.在这个 yarn 使用 ln -sf 为步骤 1 中的所有文件创建软链接之后
soft links are created for for pyspark.zip , py4j-<version>.zip,
all python files mentioned in step 1.
Now these links are again pointing to '/data/different_directories'
directory (which I am not sure where they are present).
I know soft links can be used for accessing remote nodes ,
but here why the soft links are created ?
最后但并非最不重要的一点,这个 launch_container.sh 是否会在每次容器启动时运行?
【问题讨论】:
标签: apache-spark hadoop pyspark hadoop-yarn