【问题标题】:Spark only runs my app with a single taskSpark 仅使用单个任务运行我的应用程序
【发布时间】:2021-12-29 15:43:16
【问题描述】:

我在 AWS EMR 6.4.0 之上编写了一个简单的 Spark 应用程序,它基本上是这样做的:

SparkConf sparkConf = new SparkConf().setAppName("MyAppName").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> dataSet = javaSparkContext.parallelize(_a_list_with_100_elements_);
// here I also tried to force 100 slices with .parallelize(_a_list_with_100_elements_, 100)

long count = dataSet.flatMap(....)
    .flatMap(...)
    .map(_something_that_outputs_0_or_1)
    .reduce(Integer::sum);

javaSparkContext.stop();

我正在使用以下命令运行应用程序:

aws emr add-steps --profile myprofile --region us-east-1 --cluster-id j-SOMEID --steps Type=CUSTOM_JAR,Name=test-downloader,ActionOnFailure=CONTINUE,Jar=command-runner.jar,Args=spark-submit,--class,com.my.main.MyClass,s3://somebucket/my.packaged.app-1.0.jar,-arg1,some,more,cli,args

但是无论是在本地还是在有 10 个主机的集群中,我都只能看到这样的日志:

20:20:21.354 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  some log from my flatMap with element 0 from the list
20:20:21.789 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  some log from my flatMap with element 0 from the list
20:20:22.354 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  some log from my map with element 0 from the list
20:20:22.678 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  some log from my flatMap with element 1 from the list
20:20:23.975 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  some log from my flatMap with element 1 from the list
20:20:24.354 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  some log from my map with element 1 from the list
[...] more logs with the other elements, consecutively

我总是在日志中看到任务 0,并且应用程序运行缓慢,就像它运行单个任务一样,即使我在集群中有 10 台机器。

我做错了什么?我怎样才能让它并行运行更多的东西?每张地图或平面地图都返回一个或多个元素,因此它不会没有事情可做(除了最后一张地图实际上正在下载某些东西并根据它是否成功返回 0 或 1)。

【问题讨论】:

    标签: apache-spark rdd


    【解决方案1】:

    对于 AWS EMR,您应该使用“yarn”作为主节点。这里有更多关于这意味着什么的信息:

    因此,用于启动 EMR 步骤的行如下所示:

    aws emr add-steps --profile yourProfile --region your_region --cluster-id CLUSTER_ID --steps Type=CUSTOM_JAR,Name=test-java,ActionOnFailure=CONTINUE,Jar=command-runner.jar,Args=spark-submit,--class,com.your.MyClass,--master,yarn,--deploy-mode,client,s3://your_bucket/your_jar.jar,...application_params...
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-01-28
      • 2017-02-10
      • 1970-01-01
      • 2018-04-10
      • 1970-01-01
      • 1970-01-01
      • 2014-12-05
      • 1970-01-01
      相关资源
      最近更新 更多