【发布时间】:2016-12-13 01:01:42
【问题描述】:
当我尝试向 google dataproc 集群提交 Spark 流式传输作业时,出现以下异常:
16/12/13 00:44:20 ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
...
16/12/13 00:44:20 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector@d7bffbc{HTTP/1.1}{0.0.0.0:4040}
16/12/13 00:44:20 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
16/12/13 00:44:20 ERROR org.apache.spark.util.Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:152)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1360)
...
Exception in thread "main" java.io.FileNotFoundException: File file:/tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
完整输出here。
似乎在 spark-env.sh 中未正确定义 hadoop 配置时会发生此错误 - link1, link2
它可以在某个地方配置吗?关于如何解决它的任何指示?
在本地模式下运行相同的代码可以正常工作:
sparkConf.setMaster("local[4]")
对于其他上下文:作业是这样调用的:
gcloud dataproc jobs submit spark \
--cluster my-test-cluster \
--class com.company.skyfall.Skyfall \
--jars gs://my-bucket/resources/skyfall-assembly-0.0.1.jar \
--properties spark.ui.showConsoleProgress=false
这是样板设置代码:
lazy val conf = {
val c = new SparkConf().setAppName(this.getClass.getName)
c.set("spark.ui.port", (4040 + scala.util.Random.nextInt(1000)).toString)
if (isLocal) c.setMaster("local[4]")
c.set("spark.streaming.receiver.writeAheadLog.enable", "true")
c.set("spark.streaming.blockInterval", "1s")
}
lazy val ssc = if (checkPointingEnabled) {
StreamingContext.getOrCreate(getCheckPointDirectory, createStreamingContext)
} else {
createStreamingContext()
}
private def getCheckPointDirectory: String = {
if (isLocal) localCheckPointPath else checkPointPath
}
private def createStreamingContext(): StreamingContext = {
val s = new StreamingContext(conf, Seconds(batchDurationSeconds))
s.checkpoint(getCheckPointDirectory)
s
}
提前致谢
【问题讨论】:
标签: apache-spark google-cloud-dataproc