大数据实时流数据处理分析_以流方式进行大数据流处理

大数据实时流数据处理分析

Reading: Years in Big Data. Months with Apache Flink. 5 Early Observations With Stream Processing: https://data-artisans.com/blog/early-observations-apache-flink.

阅读：大数据时代。 Apache Flink月。 5关于流处理的早期观察： https : //data-artisans.com/blog/early-observations-apache-flink。

The article suggest adopting the right solution, Flink, for big data processing. Flink is interesting and built for stream processing.

本文建议采用正确的解决方案Flink进行大数据处理。 Flink很有趣，并且为流处理而构建。

The broader view and take away may be to solve problems using the right solution. We saw many painful tries in history and in current practices still: do huge large scale data in traditional databases, do unstructured data processing in relational database, do graph processing in tables way, do stream processing in micro-batch way and etc. The specific problem should be handled by a solution built for that problem and that solution can be the most efficient and convenient one.

广阔的视野和带走的可能是使用正确的解决方案来解决问题。我们在历史上和当前的实践中仍然看到了许多痛苦的尝试：在传统数据库中进行海量数据处理，在关系数据库中进行非结构化数据处理，以表方式进行图形处理，以微批处理方式进行流处理等。应该使用针对该问题的解决方案来解决该问题，并且该解决方案可能是最有效，最方便的解决方案。

Some good examples and points from the article.

文章中的一些很好的例子和要点。

“In reality, however, processing data with as low latency as possible has been a challenge for a long time….a customer asked me how to produce an up-to-date aggregation over a tumbling five-minute window of a growing table using Hive.”

“实际上，长期以来，以尽可能低的延迟来处理数据一直是一个挑战……。一位客户问我如何在一张不断增长的桌子的五分钟滚动窗口上使用以下方法生成最新的汇总信息：蜂巢。”

“the customer and business user really need: a representation of data as a stream and the ability to do in-stream complex/stateful analytics. ”

“客户和企业用户真正需要的是：将数据表示为流，并具有进行流内复杂/状态分析的能力。 ”

“Customers and end-users wrangle with the latency gap in all kinds of interesting and expensive ways.”

“客户和最终用户以各种有趣且昂贵的方式来解决延迟差距。”

“it’s refreshing to be given constructs of stream, state, time and snapshots as the building blocks of event processing rather than incomplete concepts of keys, values, and execution phases.”

“令人耳目一新的是，将流，状态，时间和快照的构造作为事件处理的基础，而不是不完整的键，值和执行阶段的概念。”

“The first approach is to use batch as a starting point then try to build streaming on top of batch. This likely won’t meet strict latency requirements, though, because micro-batching to simulate streaming requires some fixed overhead–hence the proportion of the overhead increases as you try to reduce latency.”

“第一种方法是以批处理为起点，然后尝试在批处理之上构建流。不过，这可能无法满足严格的延迟要求，因为模拟微流化需要流水需要固定的开销，因此，当您尝试减少延迟时，开销的比例会增加。”

“However we asked ourselves if the data is being generated in real-time, why must it not be processed downstream in real-time?”

“但是，我们问自己是否是实时生成数据，为什么不必须对其进行实时下游处理？”

“requirements around low latency processing and complex analysis cannot be met in an inexpensive, scalable and fault-tolerant way.”

“围绕低延迟处理和复杂分析的需求无法以廉价，可扩展和容错的方式来满足。”

翻译自: https://www.systutorials.com/do-big-data-stream-processing-in-the-stream-way/

大数据实时流数据处理分析