【Spark】RDD核心

1. RDD

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,

partitioned collection of elements that can be operated on in parallel. This class contains the
basic operations available on all RDDs, such as map, filter, and persist. In addition,
[[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
pairs, such as groupByKey and join;
[[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
Doubles; and
[[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
can be saved as SequenceFiles.
All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
through implicit.

RDD ：不可变，分区，并行计算

单机存储 ===> 分布式存储/计算
1）数据的存储：切割 HDFS 的Block
2）数据的计算：切割 MapReduce / Spark
3) 存储与计算： HDFS/S3 + MapReduce/Spark

RDD 的特性

由partitions构成
函数在每个分区上进行
依赖其他RDD
k-v RDD 需要指定一个分区的计算方法
数据在哪先把作业调度到数据所在的节点进行计算。移动数据步入移动计算

RDD 的特性的实现

def compute(split: Partition, context: TaskContext): Iterator[T] 对应特性二
protected def getPartitions: Array[Partition] 特性一
protected def getDependencies: Seq[Dependency[_]] = deps 特性三

protected def getPreferredLocations(split: Partition): Seq[String] = Nil 特性五
val partitioner: Option[Partitioner] = None 特性四

图解RDD

【Spark】RDD核心

一个task 一个 partion
第一要务
创建SparkContext。连接到Spak集群。local standalone yarn mesos
创建SparkContext 之前设置SparkConf

创建方式
collection
外部数据

Spark 运行模式