如何将两个映射器组合成一个减速器答案

【问题标题】：How to combine two mappers to one reducer如何将两个映射器组合成一个减速器
【发布时间】：2016-04-26 16:36:12
【问题描述】：

我正在使用 hadoop 来比较两个文件。我正在使用两个映射器，每个文件都指向一个映射和一个减速器。第一个 map 会得到一个普通的文本文件，第二个 mapper 会在每一行得到一个这种格式的文件：

word 1 or -1

地图的输入是：

public void map(LongWritable key, Text value, Context context)

输出的第一张地图是：

key:word value:0

第二个映射器输出将是：

word 1 or -1

reducer 的输入是：

public void reduce(Text key, Iterable<IntWritable> values, Context context)

reducer 的输出为：

context.write(key, new IntWritable(sum));

我得到的结果是分别从每个映射中得到的，我希望减速器从两个映射中获取相同的键/值并将其转化为一个结果。这是代码。

public class CompareTwoFiles extends Configured implements Tool {
static ArabicStemmer Stemmer=new ArabicStemmer();
String ArabicWord="";

public static class Map extends Mapper <LongWritable, Text, Text, IntWritable> {

int n=0;
private Text num = new Text();
private Text word = new Text();
@Override    
public void map(LongWritable key, Text value, Context context)  throws IOException, InterruptedException {

String line = value.toString();
String token="";

StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
token=tokenizer.nextToken();
Stemmer.stemWord(token);
word.set(token);
context.write(word,new IntWritable(0));
}
}
}

public static class Map2 extends Mapper <LongWritable, Text, Text, IntWritable> {
int n=0;
private Text word = new Text();  
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String token="";

if (line.contains("1") && !line.contains("-1"))
{
n=1;
}
else if (line.contains("-1"))
{
n=-1;
}
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
token=tokenizer.nextToken();
if(!(token.equals("1"))&& !(token.equals("-1")))
{word.set(token);
context.write(word,new IntWritable(n));
}
}
}
}

public static class Reduce extends  Reducer<Text, IntWritable, Text, IntWritable> {

Text sumT= new Text();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
int num=0;
int[] intArr =new int[2];
boolean flag=false;
int i=0;

while (values.iterator().hasNext()) {            
sum += values.iterator().next().get();
}   

if(sum!=0){
context.write(key, new IntWritable(sum));
}   
}
}
public static void main(String[] args) throws Exception {
           int res = ToolRunner.run(new Configuration(), new CompareTwoFiles(), args);
System.exit(res);
}
@Override
public int run(String[] args) throws Exception {

Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://localhost:8020");
conf.set("hadoop.job.ugi", "hdfs");
Job job = new Job(conf);
job.setJarByClass(CompareTwoFiles.class);
job.setJobName("compare");
job.setReducerClass(Reduce.class);
job.setMapperClass(Map.class);
job.setMapperClass(Map2.class);
job.setCombinerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, new Path(args[1]),
TextInputFormat.class, Map2.class); 
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
return job.waitForCompletion(true) ? 0 : 1;
}

我得到的结果是这样的：

第一张地图
w1 0
w2 0
第二张地图
w1 1
w2 3
w3 -1

【问题讨论】：

job.setNumReduceTasks(1);使用它，这样您就有一个 reducer，并且来自两个映射器的所有数据都进入它。
很遗憾，我试过了，但是没用。

标签： java hadoop mapreduce key-value

【解决方案1】：

MapReduce 的整个概念是 Mapper 可以为每个键发出一个值，在您的情况下每个单词一个值，然后每个键将有一个 Reducer（在您的情况下，一个 Reducer 应该接收一个的所有计数单词）。也就是说，在 Mapper 中，您将为注册的每个单词编写类似 [key, value] 的内容。一次运行只能有一个 Mapper 类和一个 Reducer 类。

在您的情况下，听起来 MapReduce 并不适合您的问题。将一个文件与另一个文件进行比较不一定是一个自然地倾向于通过分区和并行化来提高效率的问题。您可以做的是对文本文件进行分区并将文本分区和整个word 1 or -1 文件发送到每个映射器。然后，Reducers 将为每个单词计算一个总和/值。

您也可以在此处发布您的 Mapper 和 Reducer 类。

【讨论】：

我已经发布了代码，我的意思是我希望两个映射器的结果在减速器中一起计算，而不是每个映射器单独计算。有可能吗？
是的，所有 mapper 的结果总是在 reducer 中组合，MR 所做的是收集所有具有相同 key 的 map 输出，并将输出值列表交给一个 reduce 实例。因此，每个作业只能有一个 Mapper 类，但您的代码首先注册Map.class，然后注册Map2.class，导致第一个被第二个覆盖。 IE。仅使用您的Map2.class。根据您发布的输出，w1 应该有一个 reduce 对象，w2 应该有一个，w3 应该有一个。