火花流文件流答案

【问题标题】：spark streaming fileStream火花流文件流
【发布时间】：2013-05-15 09:00:29
【问题描述】：

我正在使用火花流进行编程，但在使用 scala 时遇到了一些问题。我正在尝试使用函数 StreamingContext.fileStream

这个函数的定义是这样的：

def fileStream[K, V, F <: InputFormat[K, V]](directory: String)(implicit arg0: ClassManifest[K], arg1: ClassManifest[V], arg2: ClassManifest[F]): DStream[(K, V)]

创建一个输入流，用于监控与 Hadoop 兼容的文件系统中的新文件，并使用给定的键值类型和输入格式读取它们。以 . 开头的文件名被忽略。 ķ 读取 HDFS 文件的密钥类型五读取 HDFS 文件的值类型 F 读取 HDFS 文件的输入格式目录用于监视新文件的 HDFS 目录

我不知道如何传递 Key 和 Value 的类型。我在火花流中的代码：

val ssc = new StreamingContext(args(0), "StreamingReceiver", Seconds(1),
  System.getenv("SPARK_HOME"), Seq("/home/mesos/StreamingReceiver.jar"))

// Create a NetworkInputDStream on target ip:port and count the
val lines = ssc.fileStream("/home/sequenceFile")

编写hadoop文件的Java代码：

public class MyDriver {

private static final String[] DATA = { "One, two, buckle my shoe",
        "Three, four, shut the door", "Five, six, pick up sticks",
        "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };

public static void main(String[] args) throws IOException {
    String uri = args[0];
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);
    Path path = new Path(uri);
    IntWritable key = new IntWritable();
    Text value = new Text();
    SequenceFile.Writer writer = null;
    try {
        writer = SequenceFile.createWriter(fs, conf, path, key.getClass(),
                value.getClass());
        for (int i = 0; i < 100; i++) {
            key.set(100 - i);
            value.set(DATA[i % DATA.length]);
            System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key,
                    value);
            writer.append(key, value);
        }
    } finally {
        IOUtils.closeStream(writer);
    }
}

}

【问题讨论】：

您发现了什么问题？您是否收到编译错误？如果是这样，它们是什么？运行代码时是否出现错误/意外行为？如果您提供更多有关您所看到的错误/意外行为的背景信息，您更有可能获得有用的答案。

标签： scala streaming apache-spark

【解决方案1】：

如果你想使用fileStream，你必须在调用它时提供所有 3 种类型的参数。在调用它之前，您需要知道您的 Key、Value 和 InputFormat 类型是什么。如果你的类型是LongWritable、Text 和TextInputFormat，你可以像这样调用fileStream：

val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/sequenceFile")

如果这 3 种类型恰好是您的类型，那么您可能希望使用 textFileStream，因为它不需要任何类型参数，并且使用我提到的这 3 种类型委托给 fileStream。使用它看起来像这样：

val lines = ssc.textFileStream("/home/sequenceFile")

【讨论】：

嘿，我正在尝试做同样的事情，但是对于二进制文件，我已经按照这里的说明进行操作，不幸的是它不起作用。请问你能建议点什么吗？ stackoverflow.com/questions/45778016/…

【解决方案2】：

val filterF = new Function[Path, Boolean] {
    def apply(x: Path): Boolean = {
      val flag = if(x.toString.split("/").last.split("_").last.toLong < System.currentTimeMillis) true else false
      return flag
    }
}

val streamed_rdd = ssc.fileStream[LongWritable, Text, TextInputFormat]("/user/hdpprod/temp/spark_streaming_input",filterF,false).map(_._2.toString).map(u => u.split('\t'))

【讨论】：