MapReduce：如何让映射器处理多行？答案

【问题标题】：MapReduce: How to get mapper to process multiple lines?MapReduce：如何让映射器处理多行？
【发布时间】：2014-11-08 20:19:29
【问题描述】：

目标：

我希望能够指定输入文件中使用的映射器数量
同样，我想指定每个映射器将占用的文件行数

简单示例：

对于 10 行的输入文件（长度不等；下面的示例），我希望有 2 个映射器——因此每个映射器将处理 5 行。

This is
an arbitrary example file
of 10 lines.
Each line does
not have to be
of
the same
length or contain
the same
number of words

这是我所拥有的：

（我有它以便每个映射器产生一个“”键值对......然后它将在reducer中求和）

package org.myorg;
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.InputFormat;


public class Test {

  // prduce one "<map,1>" pair per mapper
  public static class Map extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      context.write(new Text("map"), one);
    }
  }

  // reduce by taking a sum
  public static class Red extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {      
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }


  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job1 = Job.getInstance(conf, "pass01");

    job1.setJarByClass(Test.class);
    job1.setMapperClass(Map.class);
    job1.setCombinerClass(Red.class);
    job1.setReducerClass(Red.class);

    job1.setOutputKeyClass(Text.class);
    job1.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job1, new Path(args[0]));
    FileOutputFormat.setOutputPath(job1, new Path(args[1]));

    // // Attempt#1
    // conf.setInt("mapreduce.input.lineinputformat.linespermap", 5);
    // job1.setInputFormatClass(NLineInputFormat.class);

    // // Attempt#2
    // NLineInputFormat.setNumLinesPerSplit(job1, 5);
    // job1.setInputFormatClass(NLineInputFormat.class);

    // // Attempt#3
    // conf.setInt(NLineInputFormat.LINES_PER_MAP, 5);
    // job1.setInputFormatClass(NLineInputFormat.class);

    // // Attempt#4
    // conf.setInt("mapreduce.input.fileinputformat.split.minsize", 234);
    // conf.setInt("mapreduce.input.fileinputformat.split.maxsize", 234);


    System.exit(job1.waitForCompletion(true) ? 0 : 1);
  }
}

上面的代码，使用上面的示例数据，会产生

map 10

我希望输出是

map 2

第一个映射器将对前 5 行执行某些操作，而第二个映射器将对后 5 行执行某些操作。

【问题讨论】：

标签： java hadoop input split mapreduce

【解决方案1】：

你可以使用NLineInputFormat。

使用NLineInputFormat 功能，您可以准确指定映射器应该有多少行。例如。如果您的文件有 500 行，并且您将每个映射器的行数设置为 10，那么您有 50 个映射器（而不是一个 - 假设文件小于 HDFS 块大小）。

编辑：

这里是一个使用 NLineInputFormat 的例子：

映射器类：

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {

    @Override
    public void map(LongWritable key, Text value, Context context)
          throws IOException, InterruptedException {

        context.write(key, value);
    }

}

驱动类：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class Driver extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {

        if (args.length != 2) {
            System.out
                  .printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
            return -1;
        }

        Job job = new Job(getConf());
        job.setJobName("NLineInputFormat example");
        job.setJarByClass(Driver.class);

        job.setInputFormatClass(NLineInputFormat.class);
        NLineInputFormat.addInputPath(job, new Path(args[0]));
        job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 5);

        LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(MapperNLine.class);
        job.setNumReduceTasks(0);

        boolean success = job.waitForCompletion(true);
        return success ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
        System.exit(exitCode);
    }
}

使用您提供的输入，上述示例 Mapper 的输出将被写入两个文件，因为 2 个 Mapper 被初始化：

part-m-00001

0   This is
8   an arbitrary example file
34  of 10 lines.
47  Each line does
62  not have to be

part-m-00002

77  of
80  the same
89  length or contain
107 the same
116 number of words

【讨论】：

如何正确使用“NLineInputFormat”？例子？这像我的“尝试#2”吗？（我的java不是很厉害）
@csiu 添加了一个关于如何使用NLineInputFormat 的示例，映射器非常天真，只是打印出内容。