【发布时间】:2021-02-23 10:54:03
【问题描述】:
有没有一种直接的方法可以从在 Amazon 的 Elastic Map Reduce (EMR) 下运行的 DRIVER 节点获取当前作业的 yarn ApplicationId?这是在集群模式下运行 Spark。
现在我正在使用在工作人员上运行map() 操作的代码来读取CONTAINER_ID 环境变量。这似乎效率低下。代码如下:
def applicationIdFromEnvironment():
return "_".join(['application'] + os.environ['CONTAINER_ID'].split("_")[1:3])
def applicationId():
"""Return the Yarn (or local) applicationID.
The environment variables are only set if we are running in a Yarn container.
"""
# First check to see if we are running on the worker...
try:
return applicationIdFromEnvironment()
except KeyError:
pass
# Perhaps we are running on the driver? If so, run a Spark job that finds it.
try:
from pyspark import SparkConf, SparkContext
sc = SparkContext.getOrCreate()
if "local" in sc.getConf().get("spark.master"):
return f"local{os.getpid()}"
# Note: make sure that the following map does not require access to any existing module.
appid = sc.parallelize([1]).map(lambda x: "_".join(['application'] + os.environ['CONTAINER_ID'].split("_")[1:3])).collect()
return appid[0]
except ImportError:
pass
# Application ID cannot be determined.
return f"unknown{os.getpid()}"
【问题讨论】:
-
仅使用
sc.applicationId试过吗? spark.apache.org/docs/latest/api/python/… -
我不知道它在那里!请将此作为答案,@blackbishop,我会发布我的回复。
标签: python apache-spark amazon-emr