Spark Streaming实时流处理实战笔记七

Spark Streaming核心概念

核心概念

核心概念之StreamingContext

在IDEA中搜索StreamingContext.scala

def this(sparkContext: SparkContext, batchDuration: Duration) = {
this(sparkContext, null, batchDuration)
}

def this(conf: SparkConf, batchDuration: Duration) = {
this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
}

batch interval可以根据你的应用程序的需求的延迟要求以及集群可用的资源情况来设置

After a context is defined, you have to do the following.

Define the input sources by creating input DStreams.
Define the streaming computations by applying transformation and
output operations to DStreams.
Start receiving data and processing it using
streamingContext.start().
Wait for the processing to be stopped (manually or due to any error)
using streamingContext.awaitTermination().
The processing can be manually stopped using
streamingContext.stop().

一旦StreamingContext定义好之后，就可以做一些事情

Discretized Streams (DStreams)

Internally, a DStream is represented by a continuous series of RDDs
Each RDD in a DStream contains data from a certain interval

对DStream操作算子，比如map/flatMap，其实底层会被翻译为对DStream中的每个RDD都做相同因为一个DStream是由不同批次的RDD所构成的。
Spark Streaming实时流处理实战笔记七

Input DStreams and Receivers

Input DStreams are DStreams representing the stream of input data received from streaming sources.

Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala doc, Java doc) object which
receives the data from a source and stores it
in Spark’s memory for processing.

Transformations

Output Operations

案例实战

案例一：Spark Streaming处理socket数据

报错：java.lang.NoClassDefFoundError: net/jpountz/util/SafeUtils

百度查找maven repository
Spark Streaming实时流处理实战笔记七

案例二：Spark Streaming处理HDFS文件数据

Spark Streaming实时流处理实战笔记七

文件系统对接的注意事项（官网查询）：

All files must be in the same data format.