Spark 仅使用单个任务运行我的应用程序答案

【问题标题】：Spark only runs my app with a single taskSpark 仅使用单个任务运行我的应用程序
【发布时间】：2021-12-29 15:43:16
【问题描述】：

我在 AWS EMR 6.4.0 之上编写了一个简单的 Spark 应用程序，它基本上是这样做的：

SparkConf sparkConf = new SparkConf().setAppName("MyAppName").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> dataSet = javaSparkContext.parallelize(_a_list_with_100_elements_);
// here I also tried to force 100 slices with .parallelize(_a_list_with_100_elements_, 100)

long count = dataSet.flatMap(....)
    .flatMap(...)
    .map(_something_that_outputs_0_or_1)
    .reduce(Integer::sum);

javaSparkContext.stop();

我正在使用以下命令运行应用程序：

aws emr add-steps --profile myprofile --region us-east-1 --cluster-id j-SOMEID --steps Type=CUSTOM_JAR,Name=test-downloader,ActionOnFailure=CONTINUE,Jar=command-runner.jar,Args=spark-submit,--class,com.my.main.MyClass,s3://somebucket/my.packaged.app-1.0.jar,-arg1,some,more,cli,args

但是无论是在本地还是在有 10 个主机的集群中，我都只能看到这样的日志：

20:20:21.354 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  some log from my flatMap with element 0 from the list
20:20:21.789 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  some log from my flatMap with element 0 from the list
20:20:22.354 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  some log from my map with element 0 from the list
20:20:22.678 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  some log from my flatMap with element 1 from the list
20:20:23.975 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  some log from my flatMap with element 1 from the list
20:20:24.354 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  some log from my map with element 1 from the list
[...] more logs with the other elements, consecutively

我总是在日志中看到任务 0，并且应用程序运行缓慢，就像它运行单个任务一样，即使我在集群中有 10 台机器。

我做错了什么？我怎样才能让它并行运行更多的东西？每张地图或平面地图都返回一个或多个元素，因此它不会没有事情可做（除了最后一张地图实际上正在下载某些东西并根据它是否成功返回 0 或 1）。

【问题讨论】：

标签： apache-spark rdd

【解决方案1】：

对于 AWS EMR，您应该使用“yarn”作为主节点。这里有更多关于这意味着什么的信息：

因此，用于启动 EMR 步骤的行如下所示：

aws emr add-steps --profile yourProfile --region your_region --cluster-id CLUSTER_ID --steps Type=CUSTOM_JAR,Name=test-java,ActionOnFailure=CONTINUE,Jar=command-runner.jar,Args=spark-submit,--class,com.your.MyClass,--master,yarn,--deploy-mode,client,s3://your_bucket/your_jar.jar,...application_params...

【讨论】：