【发布时间】:2017-06-19 09:43:59
【问题描述】:
根据多个source,例如Hadoop Application Architecture,多个sink可以从同一个channel读取,以增加吞吐量:
A sink can only fetch data from a single channel, but many sinks can fetch data from that same channel. A sink runs in a single thread, which has huge limitations on a single sink—for example, throughput to disk. Assume with HDFS you get 30 MBps to a single disk; if you only have one sink writing to HDFS then all you’re going to get is 30 MBps throughput with that sink. More sinks consuming from the same channel will resolve this bottleneck. The limitation with more sinks should be the network or the CPU. Unless you have a really small cluster, HDFS should never be your bottleneck.
但除此之外,还有一个带有load balancing sink processor 的接收器组的概念。根据article,无需创建接收器组即可更快地消费事件:
It is important to understand that all sinks within a sink group are not active at the same time; only one of them is sending data at any point in time. Therefore, sink groups should not be used to clear off the channel faster—in this case, multiple sinks should simply be set to operate by themselves with no sink group, and they should be configured to read from the same channel
所以,我真的不明白什么时候应该将组接收器与负载平衡器一起使用,以及何时只添加更多从一个特定通道读取的接收器。
【问题讨论】: