就数据量而言，火花流的限制是什么？答案

【问题标题】：What's the limit to spark streaming in terms of data amount?就数据量而言，火花流的限制是什么？
【发布时间】：2016-06-11 23:56:25
【问题描述】：

我有数千万行数据。是否可以在一周或一天内使用火花流分析所有这些？就数据量而言，火花流的限制是什么？我不确定上限是多少，何时应该将它们放入我的数据库中，因为 Stream 可能无法再处理它们了。我也有不同的时间窗口 1,3, 6 小时等，我使用窗口操作来分离数据。

请在下面找到我的代码：

conf = SparkConf().setAppName(appname)
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc,300)
sqlContext = SQLContext(sc)
channels = sc.cassandraTable("abc","channels")
topic = 'abc.crawled_articles'
kafkaParams = {"metadata.broker.list": "0.0.0.0:9092"}

category = 'abc.crawled_article'
category_stream = KafkaUtils.createDirectStream(ssc, [category], kafkaParams)
category_join_stream = category_stream.map(lambda x:read_json(x[1])).filter(lambda x:x!=0).map(lambda x:categoryTransform(x)).filter(lambda x:x!=0).map(lambda x:(x['id'],x))


article_stream = KafkaUtils.createDirectStream(ssc, [topic], kafkaParams)
article_join_stream=article_stream.map(lambda x:read_json(x[1])).filter(lambda x: x!=0).map(lambda x:TransformInData(x)).filter(lambda x: x!=0).flatMap(lambda x:(a for a in x)).map(lambda x:(x['id'].encode("utf-8") ,x))

#axes topic  integration the article and the axes
axes_topic = 'abc.crawled_axes'
axes_stream = KafkaUtils.createDirectStream(ssc, [axes_topic], kafkaParams)
axes_join_stream = axes_stream.filter(lambda x:'delete' not in str(x)).map(lambda x:read_json(x[1])).filter(lambda x: x!=0).map(lambda x:axesTransformData(x)).filter(lambda x: x!=0).map(lambda x:(str(x['id']),x)).map(lambda x:(x[0],{'id':x[0], 'attitudes':x[1]['likes'],'reposts':0,'comments':x[1]['comments'],'speed':x[1]['comments']}))
#axes_join_stream.reduceByKeyAndWindow(lambda x, y: x + y, 30, 10).transform(axestrans).pprint()

#join
statistics = article_join_stream.window(1*60*60,5*60).cogroup(category_join_stream.window(1*60*60,60)).cogroup((axes_join_stream.window(24*60*60,5*60)))
statistics.transform(joinstream).pprint()

ssc.start()    # Start the computation ssc.awaitTermination()
ssc.awaitTermination()

【问题讨论】：

这里有多个问题，如果你清楚地分开它们将有助于回答。此外，如果您将包含的代码最小化为足以说明问题的最小示例，这将很有帮助

标签： apache-spark spark-streaming datastax-enterprise

【解决方案1】：

一次一个：

是否可以在 [给定时间] 内分析 [一些大量的行]？

通常，是的 - Spark 允许您在多台机器上进行横向扩展，因此原则上您应该能够在相对较短的时间内启动一个大型集群并处理大量数据（假设我们谈论的是几小时或几天，而不是几秒钟或更少，这可能会因开销而产生问题）。

具体来说，在我看来，对数千万条记录执行您的问题中说明的那种处理在合理的时间内是可行的（即不使用非常大的集群）。

Spark Streaming 在数据量方面的限制是多少？

我不知道，但你很难做到。有一些非常大的部署示例，例如在ebay（“平均每天超过 30TB 的数百个指标”）。另外，请参阅FAQ，其中提到了一个由 8000 台机器组成的集群并处理 PB 的数据。

何时应将结果写入 [某种存储]？

根据Spark-Streaming的basic model，数据是微批量处理的。如果你的数据确实是一个流（即没有明确的结束），那么最简单的方法是存储每个 RDD 的处理结果（即 microbatch）。

如果您的数据不是流，例如您不时处理一堆静态文件，您可能应该考虑放弃流部分（例如，仅使用 Spark 作为批处理器）。

由于您的问题提到几个小时的窗口大小，我怀疑您可能需要考虑批处理选项。

如何在不同的时间窗口处理相同的数据？

如果您使用 Spark-Streaming，您可以维护多个状态（例如，使用 mapWithState） - 每个时间窗口一个。

另一个想法（代码更简单，操作更复杂） - 您可以启动多个集群，每个集群都有自己的窗口，从同一个流中读取。

如果您是批处理，您可以使用不同的时间窗口多次运行相同的操作，例如reduceByWindow 具有多种窗口大小。

【讨论】：