【发布时间】:2021-12-29 15:43:16
【问题描述】:
我在 AWS EMR 6.4.0 之上编写了一个简单的 Spark 应用程序,它基本上是这样做的:
SparkConf sparkConf = new SparkConf().setAppName("MyAppName").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> dataSet = javaSparkContext.parallelize(_a_list_with_100_elements_);
// here I also tried to force 100 slices with .parallelize(_a_list_with_100_elements_, 100)
long count = dataSet.flatMap(....)
.flatMap(...)
.map(_something_that_outputs_0_or_1)
.reduce(Integer::sum);
javaSparkContext.stop();
我正在使用以下命令运行应用程序:
aws emr add-steps --profile myprofile --region us-east-1 --cluster-id j-SOMEID --steps Type=CUSTOM_JAR,Name=test-downloader,ActionOnFailure=CONTINUE,Jar=command-runner.jar,Args=spark-submit,--class,com.my.main.MyClass,s3://somebucket/my.packaged.app-1.0.jar,-arg1,some,more,cli,args
但是无论是在本地还是在有 10 个主机的集群中,我都只能看到这样的日志:
20:20:21.354 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO some log from my flatMap with element 0 from the list
20:20:21.789 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO some log from my flatMap with element 0 from the list
20:20:22.354 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO some log from my map with element 0 from the list
20:20:22.678 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO some log from my flatMap with element 1 from the list
20:20:23.975 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO some log from my flatMap with element 1 from the list
20:20:24.354 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO some log from my map with element 1 from the list
[...] more logs with the other elements, consecutively
我总是在日志中看到任务 0,并且应用程序运行缓慢,就像它运行单个任务一样,即使我在集群中有 10 台机器。
我做错了什么?我怎样才能让它并行运行更多的东西?每张地图或平面地图都返回一个或多个元素,因此它不会没有事情可做(除了最后一张地图实际上正在下载某些东西并根据它是否成功返回 0 或 1)。
【问题讨论】:
标签: apache-spark rdd