1. RDD
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
- partitioned collection of elements that can be operated on in parallel. This class contains the
- basic operations available on all RDDs, such as
map,filter, andpersist. In addition, - [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
- pairs, such as
groupByKeyandjoin; - [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
- Doubles; and
- [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
- can be saved as SequenceFiles.
- All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
- through implicit.
RDD :不可变,分区,并行计算
单机存储 ===> 分布式存储/计算
1)数据的存储:切割 HDFS 的Block
2)数据的计算:切割 MapReduce / Spark
3) 存储与计算: HDFS/S3 + MapReduce/Spark
RDD 的特性
- 由partitions构成
- 函数在每个分区上进行
- 依赖其他RDD
- k-v RDD 需要指定一个分区的计算方法
- 数据在哪先把作业调度到数据所在的节点进行计算。 移动数据步入移动计算
RDD 的特性的实现
def compute(split: Partition, context: TaskContext): Iterator[T] 对应特性二
protected def getPartitions: Array[Partition] 特性一
protected def getDependencies: Seq[Dependency[_]] = deps 特性三
protected def getPreferredLocations(split: Partition): Seq[String] = Nil 特性五
val partitioner: Option[Partitioner] = None 特性四
图解RDD
一个task 一个 partion
第一要务
创建SparkContext。连接到Spak集群。local standalone yarn mesos
创建SparkContext 之前 设置SparkConf
创建方式
collection
外部数据
Spark 运行模式