【问题标题】:Find the yarn ApplicationID of of the current Spark job from the DRIVER node?从DRIVER节点找到当前Spark作业的yarn ApplicationID?
【发布时间】:2021-02-23 10:54:03
【问题描述】:

有没有一种直接的方法可以从在 Amazon 的 Elastic Map Reduce (EMR) 下运行的 DRIVER 节点获取当前作业的 yarn ApplicationId?这是在集群模式下运行 Spark。

现在我正在使用在工作人员上运行map() 操作的代码来读取CONTAINER_ID 环境变量。这似乎效率低下。代码如下:

def applicationIdFromEnvironment():
    return "_".join(['application'] + os.environ['CONTAINER_ID'].split("_")[1:3])

def applicationId():
    """Return the Yarn (or local) applicationID.
    The environment variables are only set if we are running in a Yarn container.
    """

    # First check to see if we are running on the worker...
    try:
        return applicationIdFromEnvironment()
    except KeyError:
        pass

    # Perhaps we are running on the driver? If so, run a Spark job that finds it.
    try:
        from pyspark import SparkConf, SparkContext
        sc = SparkContext.getOrCreate()
        if "local" in sc.getConf().get("spark.master"):
            return f"local{os.getpid()}"
        # Note: make sure that the following map does not require access to any existing module.
        appid = sc.parallelize([1]).map(lambda x: "_".join(['application'] + os.environ['CONTAINER_ID'].split("_")[1:3])).collect()
        return appid[0]
    except ImportError:
        pass

    # Application ID cannot be determined.
    return f"unknown{os.getpid()}"

【问题讨论】:

标签: python apache-spark amazon-emr


【解决方案1】:

您可以使用属性applicationId 直接从 SparkContext 中获取 applicationID:

Spark 应用程序的唯一标识符。它的格式取决于 调度器实现。

  • 如果是本地 spark 应用,例如“local-1433865536131”

  • YARN 的情况类似于“application_1433865536131_34483”

appid = sc.applicationId

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2015-09-25
    • 2015-07-31
    • 1970-01-01
    • 1970-01-01
    • 2018-10-15
    • 1970-01-01
    • 1970-01-01
    • 2023-04-05
    相关资源
    最近更新 更多