如何在 HADOOP 运行时生成多个文件名？答案

【问题标题】：How to generate Multiple File names in runtime in HADOOP?如何在 HADOOP 运行时生成多个文件名？
【发布时间】：2014-02-12 17:19:07
【问题描述】：

我有一些 csv 格式的数据。

例如 K1,K2,data1,data2,data3

在这里，我的映射器将密钥作为 K1K2 传递给减速器 & 值为 data1,data2,data3

我想将这些数据保存在多个文件中，文件名为 K1k2（或 reducer 获取的密钥）。现在如果我使用 MultipleOutputs 类，我必须在映射器开始之前提及文件名。但是在这里，由于只有从映射器读取数据后，我才能确定密钥。我该怎么办？

PS 我是新手。

【问题讨论】：

标签： java hadoop mapreduce

【解决方案1】：

您可以像这样生成文件名并将它们传递给 Reducer 中的 MultipleOutputs：

public void setup(Context context) {
   out = new MultipleOutputs(context);
   ...
}

public void reduce(Text key, Iterable values, Context context) throws IOException,           InterruptedException {
  for (Text t : values) {
    out.write(key, t, generateFileName(<parameter list...>));
    // generateFileName is your function
  }
}

protected void cleanup(Context context) throws IOException, InterruptedException {
  out.close();
}

更多详情请阅读 MultipleOutputs 类参考：https://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

【讨论】：

否，但它给出错误 java.lang.IllegalArgumentException: Named output 'K1K2' not defined at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.checkNamedOutputName(MultipleOutputs.java:193)
如果我添加 MultipleOutputs.addNamedOutput(job, FileName1.toString(), TextOutputFormat.class,NullWritable.class,Text.class);在 generateOutput() 方法中，我如何在减速器中获得工作。我刚开始这可能是一个非常基本的问题？
你会得到一个异常，因为你必须在运行作业的“驱动程序”中定义 MultipleOutputs。

【解决方案2】：

不需要预先定义输出文件名。这里你可以像这样使用MultipleOutputs。

public class YourReducer extends Reducer<Text, Value, Text, Value> {
private Value result = null;
private MultipleOutputs<Text,Value> out;

 public void setup(Context context) {
   out = new MultipleOutputs<Text,Value>(context);    
 }
public void reduce(Text key, Iterable<Value> values, Context context)
        throws IOException, InterruptedException {
    // do your code
    out.write(key, result,"outputpath/"+key.getText());                
}
public void cleanup(Context context) throws IOException,InterruptedException {
    out.close();        
 }

}

这里它在以下路径中给出输出

outputpath/K1
          /K2
          /K3
 .......

为此，您应该使用LazyOutputFormat.setOutputFormatClass() 而不是FileOutputFormat。还需要将作业配置添加为 job.setOutputFormatClass(NullOutputFormat.class) 。但是不要忘记像以前一样使用FileOutputFormat.setOutputPath() 和FileOutputFormat.setOutputPath() 提供输入和输出路径。那么生成的文件将相对于指定的输出路径

【讨论】：

...您必须在运行该作业的“驱动程序”中定义 MultipleOutputs。对吗？
定义多个输出和驱动是什么意思？
运行作业的文件，你必须调用 MultipleOutputs.addNamedOutput(job, ..., TextOutputFormat.class,...)