【问题标题】:Non Deterministic Behaviour of UNION of RDD in SparkSpark中RDD的UNION的非确定性行为
【发布时间】:2020-09-27 22:30:50
【问题描述】:

我正在对 3 个 RDD 执行联合操作,我知道联合不保留排序,但就我而言,这很奇怪。有人可以解释一下我的代码有什么问题吗?

我有一个 (myDF) 行数据框并转换为 RDD:-

myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")).map(rec => (2, rec))

myRdd.collect
/*
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
*/

val rowCount = myRdd.count() // Count of Records in myRdd

val header = "name:country:date:nextdate:1" // random header

// Generating Header Rdd
headerRdd = sparkContext.parallelize(Array(header), 1).map(rec => (1, rec))

//Generating Trailer Rdd
val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1).map(rec => (3, rec))

//Performing Union
val unionRdd = headerRdd.union(myRdd).union(trailerdd).map(rec => rec._2)
unionRdd.saveAsTextFile("pathLocation")

由于 Union 不保留排序,因此不应给出以下结果

输出

name:country:date:nextdate:1
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3

不使用任何排序,怎么可能得到高于输出??

sortByKey("true", 1)

但是当我从 headerRdd、myRdd 和 TrailerRdd 中删除地图时,oder 就像

Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
name:country:date:nextdate:1
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3

上述行为的可能原因是什么?

【问题讨论】:

    标签: scala sorting apache-spark union rdd


    【解决方案1】:

    在 Spark 中,特定分区中的元素是无序的,但是分区本身是有序的检查 this

    【讨论】:

    • 当我从 rdd's 中删除地图时,它没有运行。 headerRdd = sparkContext.parallelize(Array(header), 1)val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1)myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":"))val unionRdd = headerRdd.union(myRdd).union(trailerdd).saveAsTextFile("pathLocation")
    猜你喜欢
    • 1970-01-01
    • 2016-02-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-12-05
    • 2019-02-19
    • 1970-01-01
    相关资源
    最近更新 更多