【问题标题】:MapReduce: How to get mapper to process multiple lines?MapReduce:如何让映射器处理多行?
【发布时间】:2014-11-08 20:19:29
【问题描述】:

目标:

  • 我希望能够指定输入文件中使用的映射器数量
  • 同样,我想指定每个映射器将占用的文件行数

简单示例:

对于 10 行的输入文件(长度不等;下面的示例),我希望有 2 个映射器——因此每个映射器将处理 5 行。

This is
an arbitrary example file
of 10 lines.
Each line does
not have to be
of
the same
length or contain
the same
number of words

这是我所拥有的:

(我有它以便每个映射器产生一个“”键值对......然后它将在reducer中求和)

package org.myorg;
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.InputFormat;


public class Test {

  // prduce one "<map,1>" pair per mapper
  public static class Map extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      context.write(new Text("map"), one);
    }
  }

  // reduce by taking a sum
  public static class Red extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {      
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }


  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job1 = Job.getInstance(conf, "pass01");

    job1.setJarByClass(Test.class);
    job1.setMapperClass(Map.class);
    job1.setCombinerClass(Red.class);
    job1.setReducerClass(Red.class);

    job1.setOutputKeyClass(Text.class);
    job1.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job1, new Path(args[0]));
    FileOutputFormat.setOutputPath(job1, new Path(args[1]));

    // // Attempt#1
    // conf.setInt("mapreduce.input.lineinputformat.linespermap", 5);
    // job1.setInputFormatClass(NLineInputFormat.class);

    // // Attempt#2
    // NLineInputFormat.setNumLinesPerSplit(job1, 5);
    // job1.setInputFormatClass(NLineInputFormat.class);

    // // Attempt#3
    // conf.setInt(NLineInputFormat.LINES_PER_MAP, 5);
    // job1.setInputFormatClass(NLineInputFormat.class);

    // // Attempt#4
    // conf.setInt("mapreduce.input.fileinputformat.split.minsize", 234);
    // conf.setInt("mapreduce.input.fileinputformat.split.maxsize", 234);


    System.exit(job1.waitForCompletion(true) ? 0 : 1);
  }
}

上面的代码,使用上面的示例数据,会产生

map 10

我希望输出是

map 2

第一个映射器将对前 5 行执行某些操作,而第二个映射器将对后 5 行执行某些操作。

【问题讨论】:

    标签: java hadoop input split mapreduce


    【解决方案1】:

    你可以使用NLineInputFormat

    使用NLineInputFormat 功能,您可以准确指定映射器应该有多少行。 例如。如果您的文件有 500 行,并且您将每个映射器的行数设置为 10,那么您有 50 个映射器 (而不是一个 - 假设文件小于 HDFS 块大小)。

    编辑

    这里是一个使用 NLineInputFormat 的例子:

    映射器类:

    import java.io.IOException;
    
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    
    public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {
    
        @Override
        public void map(LongWritable key, Text value, Context context)
              throws IOException, InterruptedException {
    
            context.write(key, value);
        }
    
    }
    

    驱动类:

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;
    
    public class Driver extends Configured implements Tool {
    
        @Override
        public int run(String[] args) throws Exception {
    
            if (args.length != 2) {
                System.out
                      .printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
                return -1;
            }
    
            Job job = new Job(getConf());
            job.setJobName("NLineInputFormat example");
            job.setJarByClass(Driver.class);
    
            job.setInputFormatClass(NLineInputFormat.class);
            NLineInputFormat.addInputPath(job, new Path(args[0]));
            job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 5);
    
            LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
            job.setMapperClass(MapperNLine.class);
            job.setNumReduceTasks(0);
    
            boolean success = job.waitForCompletion(true);
            return success ? 0 : 1;
        }
    
        public static void main(String[] args) throws Exception {
            int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
            System.exit(exitCode);
        }
    }
    

    使用您提供的输入,上述示例 Mapper 的输出将被写入两个文件,因为 2 个 Mapper 被初始化:

    part-m-00001

    0   This is
    8   an arbitrary example file
    34  of 10 lines.
    47  Each line does
    62  not have to be
    

    part-m-00002

    77  of
    80  the same
    89  length or contain
    107 the same
    116 number of words
    

    【讨论】:

    • 如何正确使用“NLineInputFormat”?例子?这像我的“尝试#2”吗? (我的java不是很厉害)
    • @csiu 添加了一个关于如何使用NLineInputFormat 的示例,映射器非常天真,只是打印出内容。
    猜你喜欢
    • 1970-01-01
    • 2012-04-15
    • 2020-11-30
    • 2017-07-05
    • 2020-01-24
    • 1970-01-01
    • 2020-05-15
    • 1970-01-01
    • 2023-01-30
    相关资源
    最近更新 更多