【发布时间】:2016-12-19 20:39:45
【问题描述】:
最近升级到 Spark 2.0,我在尝试从 JSON 字符串创建简单数据集时看到一些奇怪的行为。这是一个简单的测试用例:
SparkSession spark = SparkSession.builder().appName("test").master("local[1]").getOrCreate();
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
JavaRDD<String> rdd = sc.parallelize(Arrays.asList(
"{\"name\":\"tom\",\"title\":\"engineer\",\"roles\":[\"designer\",\"developer\"]}",
"{\"name\":\"jack\",\"title\":\"cto\",\"roles\":[\"designer\",\"manager\"]}"
));
JavaRDD<String> mappedRdd = rdd.map(json -> {
System.out.println("mapping json: " + json);
return json;
});
Dataset<Row> data = spark.read().json(mappedRdd);
data.show();
还有输出:
mapping json: {"name":"tom","title":"engineer","roles":["designer","developer"]}
mapping json: {"name":"jack","title":"cto","roles":["designer","manager"]}
mapping json: {"name":"tom","title":"engineer","roles":["designer","developer"]}
mapping json: {"name":"jack","title":"cto","roles":["designer","manager"]}
+----+--------------------+--------+
|name| roles| title|
+----+--------------------+--------+
| tom|[designer, develo...|engineer|
|jack| [designer, manager]| cto|
+----+--------------------+--------+
“地图”功能似乎被执行了两次,即使我只执行了一个操作。我原以为 Spark 会懒惰地构建一个执行计划,然后在需要时执行它,但这似乎是为了以 JSON 格式读取数据并对其执行任何操作,该计划必须至少执行两次。
在这种简单的情况下没关系,但是当map函数长时间运行时,这就会成为一个大问题。这是对的,还是我错过了什么?
【问题讨论】:
标签: java apache-spark apache-spark-sql