【发布时间】:2018-07-25 23:22:06
【问题描述】:
我正在尝试从我的工作站(在 Intellij 内部)运行应用程序并连接到在 ec2 上运行的远程 Spark 集群 (2.3.1)。我知道这不是最佳做法,但如果我能将其用于开发,那会让我的生活变得更轻松。
我已经取得了相当大的进展,我能够在 RDD 上运行操作并返回结果,直到我到达使用 .zipWithIndex() 的步骤并且我得到以下异常:
ERROR 2018-07-19 11:16:21,137 o.a.spark.network.shuffle.RetryingBlockFetcher Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /172.x.x.x:33898
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) ~[spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) ~[spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:113) ~[spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:121) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:123) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:98) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:691) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62) [spark-combined-shaded-2.3.1-evg1.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_172]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_172]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_172]
其中 172.x.x.x 是包含 master 和 worker 的 spark 实例的 AWS VPC 内的(经过审查的)本地 IP。我已经配置了 ec2 Spark 实例,以便它应该使用带有 SPARK_PUBLIC_DNS 的公共 DNS,并使用以下配置来构建我的 SparkContext:
SparkConf sparkConf = new SparkConf()
.setAppName("myapp")
.setMaster(System.getProperty("spark.master", "spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077"))
.set("spark.cores.max", String.valueOf(4))
.set("spark.scheduler.mode", "FAIR")
.set("spark.driver.maxResultSize", String.valueOf(maxResultSize))
.set("spark.executor.memory", "2G")
.set("spark.executor.extraJavaOptions", "-XX:+UseG1GC")
.set("spark.ui.retainedStages", String.valueOf(250))
.set("spark.ui.retainedJobs", String.valueOf(250))
.set("spark.network.timeout", String.valueOf(800))
.set("spark.driver.host", "localhost")
.set("spark.driver.port", String.valueOf(23584))
.set("spark.driver.blockManager.port", String.valueOf(6578))
.set("spark.files.overwrite", "true")
;
SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
jsc.addJar("my_application.jar");
然后我创建一个 SSH 隧道
ssh -R 23584:localhost:23584 -L 44895:localhost:44895 -R 27017:localhost:27017 -R 6578:localhost:6578 ubuntu@ec2-x-x-x-x.compute-1.amazonaws.com
以便工人可以看到我的机器。我错过了什么?为什么仍然尝试通过它的 AWS IP 连接到我的机器上看不到的东西?
编辑:当我查看 Web UI 时,我可以看到 java.io.IOException: Failed to connect to /172.x.x.x:33898 中引用的端口确实属于执行程序。如何告诉我的驱动程序通过公共 IP 而不是私有 IP 进行连接?
【问题讨论】:
标签: amazon-web-services apache-spark amazon-ec2