Hadoop中的Combiners,Reducers和EcoSystemProject答案

【问题标题】：Combiners , Reducers and EcoSystemProject in HadoopHadoop中的Combiners,Reducers和EcoSystemProject
【发布时间】：2014-11-23 18:11:49
【问题描述】：

您认为site 中提到的问题 4 的答案会是什么？

答案对还是错

问题：4

In the standard word count MapReduce algorithm, why might using a combiner reduce theoverall Job running time?

A. Because combiners perform local aggregation of word counts, thereby allowing the mappers to process input data faster.
B. Because combinersperform local aggregation of word counts, thereby reducing the number of mappers that need to run.
C. Because combiners perform local aggregation of word counts, and then transfer that data toreducers without writing the intermediate data to disk.
D. Because combiners perform local aggregation of word counts, thereby reducing the number of key-value pairs that need to be snuff let across the network to the reducers.

Answer:A

和

问题：3

What happens in a MapReduce job when you set the number of reducers to one?

A. A single reducer gathers and processes all the output from all the mappers. The output iswritten in as many separate files as there are mappers.
B. A single reducer gathers andprocesses all the output from all the mappers. The output iswritten to a single file in HDFS.
C. Setting the number of reducers to one creates a processing bottleneck, and since the number of reducers as specified by the programmer is used as a reference value only, the MapReduceruntime provides a default setting for the number of reducers.
D. Setting the number of reducers to one is invalid, and an exception is thrown.
Answer:A

根据我对以上问题的理解回答

Question 4: D
Question 3: B

更新

You have user profile records in your OLTP database,that you want to join with weblogs you have already ingested into HDFS.How will you obtain these user records?
Options
A. HDFS commands
B. Pig load
C. Sqoop import
D. Hive
Answer:B

对于更新的问题，我对 B 和 C

持怀疑态度

编辑

正确答案：Sqoop。

【问题讨论】：

+1 向任何想在那里投资的人指出这一点......
也请看更新

标签： hadoop mapreduce reducers combiners

【解决方案1】：

据我了解，这两个答案都是错误的。

我对@987654322@ 的工作不多，但我发现它到处都在处理Mapper 的输出。 问题 4 的答案应该是 D。

再次从实践经验中我发现输出文件的数量总是等于Reducers 的数量。所以问题3的答案应该是B。使用MultipleOutputs 时可能不会出现这种情况，但这并不常见。

最后，我认为 Apache 不会对 MapReduce 撒谎（确实会发生异常 :)。这两个问题的答案都可以在他们的wiki page 中找到。看一看。

顺便说一句，我喜欢您提供的链接上的 “100% 通过保证或退款！！！” 引用 ;-)

编辑
由于我对 Pig & Sqoop 知之甚少，因此不确定更新部分中的问题。但当然可以使用 Hive 通过在 HDFS 数据上创建外部表然后加入来实现相同的目的。

更新
在用户 milk3422 和所有者的 cmets 之后，我进行了一些搜索，发现我假设 Hive 是最后一个问题的答案是错误的，因为涉及另一个 OLTP 数据库.正确的答案应该是 C，因为 Sqoop 旨在在 HDFS 和关系数据库之间传输数据。

【讨论】：

+1 两个答案。我想很多人应该要求他们的钱回来......
是的，我也选择了相同的答案。
对于问题4：在map reduce中，Combiners在Mapper之后，在数据发送到Reducer之前。组合器用于执行聚合以最小化发送到 Reducer 的信息量。 D 是正确答案。对于问题 3：您将拥有与 reducer 一样多的输出文件，因此答案 A 不正确。答案 B 是正确答案。
@milk3422：关于更新问题..为什么不能是sqoop（从其他数据库导入数据）？我对其他生态系统项目的了解很少
对不起，我没有解决第三个问题。该问题的最佳答案是 C Sqoop Import。 Sqoop 用于将数据导入/导出到 HDFS 和关系数据库。我用过很多次 Pig，而 Pig 不是为从 HDFS 读取而设计的。

【解决方案2】：

问题 4 和 3 的答案对我来说似乎是正确的。对于问题 4，这是非常合理的，因为在使用组合器时，地图输出被保存在集合 n 中，首先处理，然后缓冲区在满时被刷新。为了证明这一点，我将添加此链接：http://wiki.apache.org/hadoop/HadoopMapReduce

这里清楚地说明了为什么组合器会加快进程。

另外，我认为 q.3 的答案也是正确的，因为通常这是默认配置的基本配置。为了证明我将添加另一个信息链接：https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/mapreduce-types

【讨论】：

@hunter30：我不认为 4 和 3 是正确的答案，因为我们只设置了 1 个减速器，所以我们所有的最终输出都到了 1 个减速器，因此单个文件将是输出。即输出文件将等于没有减速器。对于问题 4-组合器在 map 和 reduce 之间完成，组合器在加速映射器过程中没有作用
实际上，您也可以从单个减速器输出到多个输出文件。看一下hadoop API：hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/…
此外，combiner 减少了 map n reduce 阶段之间对 hdfs 的读写操作。一些内置的 java 示例（如 wordcount）直接使用相同的类用于 combiner 和 reducer