Spark学习---RDD - 爱码网

说的挺好的就是有点乱，有一些资料汇总。

另外图解spark 核心技术与案例实战这本书也挺好的，推荐入门者读。

总而言之RDD是弹性分布式数据集，推荐大家背英文

•RDD is the spark's core abstraction which is short for resilient distributed dataset.

这里的resilient的意思是：able to withstand or recover quickly from difficult conditions.

即这些集合是弹性的，如果数据集的一部分丢失，则可以根据“血统”对它们进行重建，保证了数据的高容错性。由于RDD提供一种基于粗粒度变换的接口，该接口会将相同的操作应用到多个数据集，这就算他们可以记录创建数据集的血统，而不需要存储真正的数据，从而达到高效的容错性，这么看来是不是有点区块链的意思呀。

•It is the immutable distributed collection of objects and has the ability to be recomputed from history.

Spark学习---RDD

•Internally spark distributes the data in RDD to different nodes across the cluster to achieve parallelization.

RDD的支持两种操作：转换(Transformation)和动作(actions). 转换（transformation）从现有的数据集创建一个新的数据集；而动作（actions）在数据集上运行计算后，返回一个值给驱动程序。

RDD的依赖关系：

Spark学习---RDD