【发布时间】:2018-03-02 17:18:52
【问题描述】:
我有一个简单的字数统计程序,打包为object:
object MyApp {
val path = "file:///home/sergey/spark/spark-2.2.0/README.md"
val readMe = sc.textFile(path)
val stop = List("to","the","a")
val res = (readMe
.flatMap(_.split("\\W+"))
.filter(_.length > 0)
.map(_.toLowerCase)
.filter(!stop.contains(_))
.map((_, 1))
.reduceByKey(_ + _)
.sortBy(-_._2)
)
println(res.take(3).mkString)
}
当我尝试执行它时,我得到:
scala> MyApp
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:387)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:386)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.filter(RDD.scala:386)
... 51 elided
Caused by: java.io.NotSerializableException: MyApp$
Serialization stack:
- object not serializable (class: MyApp$, value: MyApp$@7bd44868)
- field (class: MyApp$$anonfun$5, name: $outer, type: class MyApp$)
- object (class MyApp$$anonfun$5, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
(罪魁祸首是.filter(!stop.contains(_))线。
但是,当我逐行执行相同的代码时,它运行良好并产生了预期的结果。
非常感谢您回答 2 个问题:
逐行执行和单例执行之间有什么不同,以至于一个运行而另一个失败?
除了将
!stop.contains(_)闭包和stop列表打包到另一个对象中之外,还有什么其他解决方案?
【问题讨论】:
-
您是否尝试将所有代码都放在一个主目录中?或者也许将
extends App添加到您的对象定义中? -
@philantrovert 我正在执行此代码
as-isinspark-shell
标签: scala apache-spark serialization closures