1. RDD

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,

  • partitioned collection of elements that can be operated on in parallel. This class contains the
  • basic operations available on all RDDs, such as map, filter, and persist. In addition,
  • [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
  • pairs, such as groupByKey and join;
  • [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
  • Doubles; and
  • [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
  • can be saved as SequenceFiles.
  • All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
  • through implicit.

RDD :不可变,分区,并行计算

单机存储 ===> 分布式存储/计算
1)数据的存储:切割 HDFS 的Block
2)数据的计算:切割 MapReduce / Spark
3) 存储与计算: HDFS/S3 + MapReduce/Spark

RDD 的特性

  • 由partitions构成
  • 函数在每个分区上进行
  • 依赖其他RDD
  • k-v RDD 需要指定一个分区的计算方法
  • 数据在哪先把作业调度到数据所在的节点进行计算。 移动数据步入移动计算

RDD 的特性的实现

def compute(split: Partition, context: TaskContext): Iterator[T] 对应特性二
protected def getPartitions: Array[Partition] 特性一
protected def getDependencies: Seq[Dependency[_]] = deps 特性三

protected def getPreferredLocations(split: Partition): Seq[String] = Nil 特性五
val partitioner: Option[Partitioner] = None 特性四

图解RDD

【Spark】RDD核心

一个task 一个 partion
第一要务
创建SparkContext。连接到Spak集群。local standalone yarn mesos
创建SparkContext 之前 设置SparkConf

创建方式
collection
外部数据

Spark 运行模式

相关文章: