菜鸟的Spark 源码学习之路 -8 RDD

前文对shuffle的过程进行了学习，shuffle操作本身是基于RDD之间的依赖关系，在RDD之间产生宽依赖是则会有Shuffle。

RDD是Spark中最重要的数据抽象。本文开始，我们将学习SparkRdd的实现细节。

1. 概览

/**
 * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
 * partitioned collection of elements that can be operated on in parallel. This class contains the
 * basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
 * [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
 * pairs, such as `groupByKey` and `join`;
 * [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
 * Doubles; and
 * [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
 * can be saved as SequenceFiles.
 * All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
 * through implicit.
 *
 * Internally, each RDD is characterized by five main properties:
 *  RDD有五个重要特性： 
      1. 包含一个partition列表
      2. 包含一个每个分片的函数
      3. 包含一个记录与其他RDD的依赖关系列表
      4. 可选情况下，对于k-v类型的rdd，有一个分区器
      5. 可选情况下，有一个记录计算分片最优位置的列表
 *  - A list of partitions
 *  - A function for computing each split
 *  - A list of dependencies on other RDDs
 *  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 *  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
 *    an HDFS file)
 *
 * All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
 * to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
 * reading data from a new storage system) by overriding these functions. Please refer to the
 * <a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a>
 * for more details on RDD internals.
 */
abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging

这里面封装了很多我们熟悉的操作：

菜鸟的Spark 源码学习之路 -8 RDD

还有纷繁复杂的子类

菜鸟的Spark 源码学习之路 -8 RDD

RDD的各种子类实现，多数都是增加该类RDD特有的特性。

2. RDD创建过程记录

回到我们的本地测试demo中

val conf = new SparkConf().setAppName("myApp").setMaster("local[2]")
val spark = new SparkContext(conf)
val data = Array(1, 2, 3, 4)
val disData = spark.parallelize(data)

我们调用了parallelize方法生成了RDD：

/** Distribute a local Scala collection to form an RDD.
  *
  * @note Parallelize acts lazily. If `seq` is a mutable collection and is altered after the call
  *       to parallelize and before the first action on the RDD, the resultant RDD will reflect the
  *       modified collection. Pass a copy of the argument to avoid this.
  * @note avoid using `parallelize(Seq())` to create an empty `RDD`. Consider `emptyRDD` for an
  *       RDD with no partitions, or `parallelize(Seq[T]())` for an RDD of `T` with empty partitions.
  * @param seq       Scala collection to distribute
  * @param numSlices number of partitions to divide the collection into
  * @return RDD representing distributed collection
  */
def parallelize[T: ClassTag](
                              seq: Seq[T],
                              numSlices: Int = defaultParallelism): RDD[T] = withScope {
  assertNotStopped()
  // 这里创建了RDD
  new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}

RDDOperationScope这个类用于跟踪RDD操作历史和关联关系。包含一个类和一个伴生对象

菜鸟的Spark 源码学习之路 -8 RDD

/**
 * A general, named code block representing an operation that instantiates RDDs.
 * 它是一个统一的，指定的代表初始化RDD工作的代码块。
 * All RDDs instantiated in the corresponding code block will store a pointer to this object.
 * Examples include, but will not be limited to, existing RDD operations, such as textFile,
 * reduceByKey, and treeAggregate.
 * Scope之间可以嵌套
 * An operation scope may be nested in other scopes. For instance, a SQL query may enclose
 * scopes associated with the public RDD APIs it uses under the hood.
 *
 * There is no particular relationship between an operation scope and a stage or a job.
 * A scope may live inside one stage (e.g. map) or span across multiple jobs (e.g. take).
 */
@JsonInclude(Include.NON_NULL)
@JsonPropertyOrder(Array("id", "name", "parent"))
private[spark] class RDDOperationScope(
    val name: String,
    val parent: Option[RDDOperationScope] = None,
    val id: String = RDDOperationScope.nextScopeId().toString)

**
 * A collection of utility methods to construct a hierarchical representation of RDD scopes.
 * An RDD scope tracks the series of operations that created a given RDD.
 */
private[spark] object RDDOperationScope extends Logging

实际调用的RDDOperationScope.withScope方法。

/**
 * Execute the given body such that all RDDs created in this body will have the same scope.
 * The name of the scope will be the first method name in the stack trace that is not the
 * same as this method's.
 *
 * Note: Return statements are NOT allowed in body.
 */
private[spark] def withScope[T](
    sc: SparkContext,
    allowNesting: Boolean = false)(body: => T): T = {
  val ourMethodName = "withScope"
  // 获取线程的方法调用栈
  val callerMethodName = Thread.currentThread.getStackTrace()
    .dropWhile(_.getMethodName != ourMethodName)
    .find(_.getMethodName != ourMethodName)
    .map(_.getMethodName)
    .getOrElse {
      // Log a warning just in case, but this should almost certainly never happen
      logWarning("No valid method name for this RDD operation scope!")
      "N/A"
    }
  withScope[T](sc, callerMethodName, allowNesting, ignoreParent = false)(body)
}

/**
 * Execute the given body such that all RDDs created in this body will have the same scope.
 *
 * If nesting is allowed, any subsequent calls to this method in the given body will instantiate
 * child scopes that are nested within our scope. Otherwise, these calls will take no effect.
 *
 * Additionally, the caller of this method may optionally ignore the configurations and scopes
 * set by the higher level caller. In this case, this method will ignore the parent caller's
 * intention to disallow nesting, and the new scope instantiated will not have a parent. This
 * is useful for scoping physical operations in Spark SQL, for instance.
 *
 * Note: Return statements are NOT allowed in body.
 */
private[spark] def withScope[T](
    sc: SparkContext,
    name: String,
    allowNesting: Boolean,
    ignoreParent: Boolean)(body: => T): T = {
  // Save the old scope to restore it later
  val scopeKey = SparkContext.RDD_SCOPE_KEY
  val noOverrideKey = SparkContext.RDD_SCOPE_NO_OVERRIDE_KEY
  val oldScopeJson = sc.getLocalProperty(scopeKey)
  val oldScope = Option(oldScopeJson).map(RDDOperationScope.fromJson)
  val oldNoOverride = sc.getLocalProperty(noOverrideKey)
  try {
    if (ignoreParent) {
      // Ignore all parent settings and scopes and start afresh with our own root scope
      sc.setLocalProperty(scopeKey, new RDDOperationScope(name).toJson)
    } else if (sc.getLocalProperty(noOverrideKey) == null) {
      // Otherwise, set the scope only if the higher level caller allows us to do so
      sc.setLocalProperty(scopeKey, new RDDOperationScope(name, oldScope).toJson)
    }
    // Optionally disallow the child body to override our scope
    if (!allowNesting) {
      sc.setLocalProperty(noOverrideKey, "true")
    }
    body
  } finally {
    // Remember to restore any state that was modified before exiting
    sc.setLocalProperty(scopeKey, oldScopeJson)
    sc.setLocalProperty(noOverrideKey, oldNoOverride)
  }
}

通过RDDOperationScope里面的方法，可以追踪RDD上的操作。