spark streaming 5: InputDStream

InputDStream的继承关系。他们都是使用InputDStream这个抽象类的接口进行操作的。特别注意ReceiverInputDStream这个类，大部分时候我们使用的是它作为扩展的基类，因为它才能（更容易）使接收数据的工作分散到各个worker上执行，更符合分布式计算的理念。

所有的输入流都某个时间间隔将数据以block的形式保存到spark memory中，但以spark core不同的是，spark streaming默认是将对象序列化后保存到内存中。

/**
 * This is the abstract base class . This class provides methods
 * start() and stop() which is called by Spark Streaming system to .
 * Input streams that can  For example,
 * FileInputDStream, a subclass of InputDStream, monitors a HDFS directory from the driver for
 * new files and generates RDDs with the new files. .
 *
 * @param ssc_ Streaming context that will execute this input stream
 */
abstract class T@transient extends Tprivatevar lastValidTimenull

  graphthis

/**
 * Abstract class for defining any [[org.apache.spark.streaming.dstream.InputDStream]]
 * that has to start a receiver on worker nodes to receive external data.
 *  * @param ssc_ Streaming context that will execute this input stream
 * @tparam T Class type of the object of this stream
 */
abstract class T@transient extends T/** Keeps all received blocks information */
  private lazy val new , /** This is an unique identifier for the network input stream. */
  val id

/**
 * Gets the receiver object that will be sent to the worker nodes
 * to receive data. This method needs to defined by any specific implementation
 * of a NetworkInputDStream.
 */
def getReceiver(): Receiver[T]

最终都是以BlockRDD返回的

/** Ask ReceiverInputTracker for received data blocks and generates RDDs with them. */
override def compute(validTime: Time): Option[RDD[T]] = {
// If this is called for any time before the start time of the context,
  // then this returns an empty RDD. This may happen when recovering from a
  // master failure
  if (validTime >= graph.startTime) {
val blockInfo = ssc.scheduler.receiverTracker.getReceivedBlockInfo(id)
receivedBlockInfo(validTime) = blockInfo
val blockIds = blockInfo.map(_.blockId.asInstanceOf[BlockId])
Some(new BlockRDD[T](ssc.sc, blockIds))
  } else {
Some(new BlockRDD[T](ssc.sc, Array[BlockId]()))
  }
}

From WizNote

posted on 2015-02-05 17:17 过雁阅读(...) 评论(...) 编辑收藏