【问题标题】:Output contents of DStream in Scala Apache SparkScala Apache Spark 中 DStream 的输出内容
【发布时间】:2015-05-14 14:45:45
【问题描述】:

下面的 Spark 代码似乎没有对文件 example.txt 执行任何操作

val conf = new org.apache.spark.SparkConf()
  .setMaster("local")
  .setAppName("filter")
  .setSparkHome("C:\\spark\\spark-1.2.1-bin-hadoop2.4")
  .set("spark.executor.memory", "2g");

val ssc = new StreamingContext(conf, Seconds(1))
val dataFile: DStream[String] = ssc.textFileStream("C:\\example.txt")

dataFile.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate

我正在尝试使用 dataFile.print() 打印文件的前 10 个元素

一些生成的输出:

15/03/12 12:23:53 INFO JobScheduler: Started JobScheduler
15/03/12 12:23:54 INFO FileInputDStream: Finding new files took 105 ms
15/03/12 12:23:54 INFO FileInputDStream: New files at time 1426163034000 ms:

15/03/12 12:23:54 INFO JobScheduler: Added jobs for time 1426163034000 ms
15/03/12 12:23:54 INFO JobScheduler: Starting job streaming job 1426163034000 ms.0 from job set of time 1426163034000 ms
-------------------------------------------
Time: 1426163034000 ms
-------------------------------------------

15/03/12 12:23:54 INFO JobScheduler: Finished job streaming job 1426163034000 ms.0 from job set of time 1426163034000 ms
15/03/12 12:23:54 INFO JobScheduler: Total delay: 0.157 s for time 1426163034000 ms (execution: 0.006 s)
15/03/12 12:23:54 INFO FileInputDStream: Cleared 0 old files that were older than 1426162974000 ms: 
15/03/12 12:23:54 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
15/03/12 12:23:55 INFO FileInputDStream: Finding new files took 2 ms
15/03/12 12:23:55 INFO FileInputDStream: New files at time 1426163035000 ms:

15/03/12 12:23:55 INFO JobScheduler: Added jobs for time 1426163035000 ms
15/03/12 12:23:55 INFO JobScheduler: Starting job streaming job 1426163035000 ms.0 from job set of time 1426163035000 ms
-------------------------------------------
Time: 1426163035000 ms
-------------------------------------------

15/03/12 12:23:55 INFO JobScheduler: Finished job streaming job 1426163035000 ms.0 from job set of time 1426163035000 ms
15/03/12 12:23:55 INFO JobScheduler: Total delay: 0.011 s for time 1426163035000 ms (execution: 0.001 s)
15/03/12 12:23:55 INFO MappedRDD: Removing RDD 1 from persistence list
15/03/12 12:23:55 INFO BlockManager: Removing RDD 1
15/03/12 12:23:55 INFO FileInputDStream: Cleared 0 old files that were older than 1426162975000 ms: 
15/03/12 12:23:55 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
15/03/12 12:23:56 INFO FileInputDStream: Finding new files took 3 ms
15/03/12 12:23:56 INFO FileInputDStream: New files at time 1426163036000 ms:

example.txt 的格式为:

gdaeicjdcg,194,155,98,107
jhbcfbdigg,73,20,122,172
ahdjfgccgd,28,47,40,178
afeidjjcef,105,164,37,53
afeiccfdeg,29,197,128,85
aegddbbcii,58,126,89,28
fjfdbfaeid,80,89,180,82

正如print 文档所述:

/** * 打印此 DStream 中生成的每个 RDD 的前十个元素。这是一个输出 * 操作符,所以这个 DStream 将被注册为输出流并在那里实现。 */

这是否意味着已经为这个流生成了 0 个 RDD?如果想查看 RDD 的内容,使用 Apache Spark 将使用 RDD 的收集功能。这些是 Streams 的类似方法吗?那么简而言之,如何打印到 Stream 的控制台内容?

更新:

根据@0x0FFF 注释更新了代码。 http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html 似乎没有给出从本地文件系统读取的示例。这不像使用 Spark 核心那样常见吗,那里有从文件中读取数据的明确示例?

这里是更新的代码:

val conf = new org.apache.spark.SparkConf()
  .setMaster("local[2]")
  .setAppName("filter")
  .setSparkHome("C:\\spark\\spark-1.2.1-bin-hadoop2.4")
  .set("spark.executor.memory", "2g");

val ssc = new StreamingContext(conf, Seconds(1))
val dataFile: DStream[String] = ssc.textFileStream("file:///c:/data/")

dataFile.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate

但是输出是一样的。当我将新文件添加到c:\\data dir(与现有数据文件具有相同格式)时,它们不会被处理。我假设dataFile.print 应该将前 10 行打印到控制台?

更新 2:

也许这与我在 Windows 环境中运行此代码有关?

【问题讨论】:

    标签: scala apache-spark


    【解决方案1】:

    您误解了textFileStream 的用法。以下是 Spark 文档中的描述:

    创建一个输入流,用于监控与 Hadoop 兼容的文件系统中的新文件并将它们作为文本文件读取(使用 LongWritable 的键、Text 的值和 TextInputFormat 的输入格式)。

    首先,您应该将目录传递给它,其次,该目录应该可以从运行接收器的节点获得,因此最好使用 HDFS 来实现此目的。然后,当您将 new 文件放入此目录时,它将由函数 print() 处理并为其打印前 10 行

    更新:

    我的代码:

    [alex@sparkdemo tmp]$ pyspark --master local[2]
    Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) 
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    Spark assembly has been built with Hive, including Datanucleus jars on classpath
    s15/03/12 06:37:49 WARN Utils: Your hostname, sparkdemo resolves to a loopback address: 127.0.0.1; using 192.168.208.133 instead (on interface eth0)
    15/03/12 06:37:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
    
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 1.2.0
          /_/
    
    Using Python version 2.6.6 (r266:84292, Nov 22 2013 12:16:22)
    SparkContext available as sc.
    >>> from pyspark.streaming import StreamingContext
    >>> ssc = StreamingContext(sc, 30)
    >>> dataFile = ssc.textFileStream('file:///tmp')
    >>> dataFile.pprint()
    >>> ssc.start()
    >>> ssc.awaitTermination()
    -------------------------------------------
    Time: 2015-03-12 06:40:30
    -------------------------------------------
    
    -------------------------------------------
    Time: 2015-03-12 06:41:00
    -------------------------------------------
    
    -------------------------------------------
    Time: 2015-03-12 06:41:30
    -------------------------------------------
    1 2 3
    4 5 6
    7 8 9
    
    -------------------------------------------
    Time: 2015-03-12 06:42:00
    -------------------------------------------
    

    【讨论】:

    • 所以无法使用从本地文件系统读取的数据测试 Spark Streams?
    • 如果您在本地运行流式传输并允许它使用至少 2 个核心,即本地 [2],并且还指定您使用的是目录名称中带有 file:// 的本地文件系统,则可以这样做
    • 我在 CentOS 上对 pyspark 和本地文件做了完全一样的事情,而且效果很好。问题可能与Windows有关
    • 感谢您发布有用的答案,但我将保留此问题,因为 Windows 不支持 Spark Streams,或者我没有正确读取 Windows 的文件。我没有找到明确说明这一点的文档,但也没有找到详细说明 Spark Streams Windows 示例的文档。
    • 很好,也许有人知道 Windows 的细节。我认为您应该尝试 spark 邮件列表 user@spark.apache.org
    【解决方案2】:

    这是我编写的一个自定义接收器,它在指定目录监听数据:

    package receivers
    
    import java.io.File
    import org.apache.spark.{ SparkConf, Logging }
    import org.apache.spark.storage.StorageLevel
    import org.apache.spark.streaming.{ Seconds, StreamingContext }
    import org.apache.spark.streaming.receiver.Receiver
    
    class CustomReceiver(dir: String)
      extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
    
      def onStart() {
        // Start the thread that receives data over a connection
        new Thread("File Receiver") {
          override def run() { receive() }
        }.start()
      }
    
      def onStop() {
        // There is nothing much to do as the thread calling receive()
        // is designed to stop by itself isStopped() returns false
      }
    
      def recursiveListFiles(f: File): Array[File] = {
        val these = f.listFiles
        these ++ these.filter(_.isDirectory).flatMap(recursiveListFiles)
      }
    
      private def receive() {
    
        for (f <- recursiveListFiles(new File(dir))) {
    
          val source = scala.io.Source.fromFile(f)
          val lines = source.getLines
          store(lines)
          source.close()
          logInfo("Stopped receiving")
          restart("Trying to connect again")
    
        }
      }
    }
    

    我认为需要注意的一件事是文件需要在 batchDuration 的时间内处理。在下面的示例中,它设置为 10 秒,但如果接收方处理文件的时间超过 10 秒,则某些数据文件将不会被处理。我愿意在这一点上进行更正。

    这是自定义接收器的实现方式:

    val conf = new org.apache.spark.SparkConf()
      .setMaster("local[2]")
      .setAppName("filter")
      .setSparkHome("C:\\spark\\spark-1.2.1-bin-hadoop2.4")
      .set("spark.executor.memory", "2g");
    
    val ssc = new StreamingContext(conf, Seconds(10))
    
    val customReceiverStream: ReceiverInputDStream[String] = ssc.receiverStream(new CustomReceiver("C:\\data\\"))
    
    customReceiverStream.print
    
    customReceiverStream.foreachRDD(m => {
      println("size is " + m.collect.size)
    })
    
    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
    

    更多信息: http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html & https://spark.apache.org/docs/1.2.0/streaming-custom-receivers.html

    【讨论】:

      【解决方案3】:

      我可能发现了您的问题,您的日志中应该有这个:

      WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
      

      问题是您需要至少有 2 个内核才能运行 Spark 流应用程序。 所以解决方案应该是简单地替换:

      val conf = new org.apache.spark.SparkConf()
       .setMaster("local")
      

      作者:

      val conf = new org.apache.spark.SparkConf()
        .setMaster("local[*]")
      

      或者至少不止一个。

      【讨论】:

        猜你喜欢
        • 2015-11-05
        • 1970-01-01
        • 1970-01-01
        • 2019-02-07
        • 1970-01-01
        • 2018-07-06
        • 2016-03-24
        • 2014-12-21
        • 2020-08-23
        相关资源
        最近更新 更多