Hadoop（java）改变Mapper输出值的类型答案

【问题标题】：Hadoop (java) change the type of Mapper output valuesHadoop（java）改变Mapper输出值的类型
【发布时间】：2016-03-03 03:02:28
【问题描述】：

我正在编写一个映射器函数，它将键生成为一些 user_id，值也是文本类型。这是我的做法

public static class UserMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text userid = new Text();
    private Text catid = new Text();

    /* map method */
    public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString(), ","); /* separated by "," */
        int count = 0;

        userid.set(itr.nextToken());

        while (itr.hasMoreTokens()) {
            if (++count == 3) {
                catid.set(itr.nextToken());
                context.write(userid, catid);
            }else {
                itr.nextToken();
            }
        }
    }
}

然后，在主程序中，我将映射器的输出类设置如下：

    Job job = new Job(conf, "Customer Analyzer");
    job.setJarByClass(popularCategories.class);
    job.setMapperClass(UserMapper.class);
    job.setCombinerClass(UserReducer.class);
    job.setReducerClass(UserReducer.class);

    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);

所以即使我将输出值的类设置为Text.class，编译时仍然出现以下错误：

popularCategories.java:39: write(org.apache.hadoop.io.Text,org.apache.hadoop.io.IntWritable)
 in org.apache.hadoop.mapreduce.TaskInputOutputContext<java.lang.Object,
 org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,
 org.apache.hadoop.io.IntWritable> 
 cannot be applied to (org.apache.hadoop.io.Text,org.apache.hadoop.io.Text)
 context.write(userid, catid);
                           ^

根据这个错误，还在考虑这种格式的mapper类：write(org.apache.hadoop.io.Text,org.apache.hadoop.io.IntWritable)

所以，当我如下更改类定义时，问题就解决了。

 public static class UserMapper extends Mapper<Object, Text, Text, Text> {

 }

所以，我想了解类定义和设置映射器输出值类有什么区别。

【问题讨论】：

标签： java apache hadoop types mapreduce

【解决方案1】：

在您的映射器类定义中，您将 outputValue 类设置为 IntWriteable。

public static class UserMapper extends Mapper<Object, Text, Text, IntWritable>

但是，在映射器类中，您将 catId 实例化为文本。

private Text catid = new Text();

即使您已将 MapOutputValueClass 设置为 Text，您仍需要更改映射器类的定义以与驱动程序中设置的键和值输出类同步。

【讨论】：

【解决方案2】：

来自 Apache 文档page

Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

java.lang.Object
org.apache.hadoop.mapreduce.Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

在哪里

KEYIN = offset of the record  ( input for Mapper )
VALUEIN = value of the line in the record ( input for Mapper )
KEYOUT = Mapper output key ( Output of Mapper, input of Reducer)
VALUEOUT = Mapper output value ( Output of Mapper, input to Reducer)

在您从

更正定义中的 Mapper 值后，您的问题已解决

public static class UserMapper extends Mapper<Object, Text, Text, IntWritable> {

到

public static class UserMapper extends Mapper<Object, Text, Text, Text> {

我发现这个article 也有助于清楚地理解这些概念。

【讨论】：

【解决方案3】：

类定义同时具有输入和输出类型。例如，您的 Mapper 接收 Object,Text 并发出 Text,Text。在您的驱动程序类中，您已将键和值的 Mapper 类的预期输出设置为 Text，因此 hadoop 框架期望您的 Mapper 类定义具有这些输出类型，并且您的类发出 Text调用context.write(Text,Text) 时的键和值。

【讨论】：