【发布时间】:2014-11-08 20:19:29
【问题描述】:
目标:
- 我希望能够指定输入文件中使用的映射器数量
- 同样,我想指定每个映射器将占用的文件行数
简单示例:
对于 10 行的输入文件(长度不等;下面的示例),我希望有 2 个映射器——因此每个映射器将处理 5 行。
This is
an arbitrary example file
of 10 lines.
Each line does
not have to be
of
the same
length or contain
the same
number of words
这是我所拥有的:
(我有它以便每个映射器产生一个“
package org.myorg;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.InputFormat;
public class Test {
// prduce one "<map,1>" pair per mapper
public static class Map extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
context.write(new Text("map"), one);
}
}
// reduce by taking a sum
public static class Red extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job1 = Job.getInstance(conf, "pass01");
job1.setJarByClass(Test.class);
job1.setMapperClass(Map.class);
job1.setCombinerClass(Red.class);
job1.setReducerClass(Red.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job1, new Path(args[0]));
FileOutputFormat.setOutputPath(job1, new Path(args[1]));
// // Attempt#1
// conf.setInt("mapreduce.input.lineinputformat.linespermap", 5);
// job1.setInputFormatClass(NLineInputFormat.class);
// // Attempt#2
// NLineInputFormat.setNumLinesPerSplit(job1, 5);
// job1.setInputFormatClass(NLineInputFormat.class);
// // Attempt#3
// conf.setInt(NLineInputFormat.LINES_PER_MAP, 5);
// job1.setInputFormatClass(NLineInputFormat.class);
// // Attempt#4
// conf.setInt("mapreduce.input.fileinputformat.split.minsize", 234);
// conf.setInt("mapreduce.input.fileinputformat.split.maxsize", 234);
System.exit(job1.waitForCompletion(true) ? 0 : 1);
}
}
上面的代码,使用上面的示例数据,会产生
map 10
我希望输出是
map 2
第一个映射器将对前 5 行执行某些操作,而第二个映射器将对后 5 行执行某些操作。
【问题讨论】:
标签: java hadoop input split mapreduce