如何在 gzip 文件上使用 CombineFileInputFormat？答案

【问题标题】：How to use CombineFileInputFormat on gzip files?如何在 gzip 文件上使用 CombineFileInputFormat？
【发布时间】：2015-09-23 01:22:25
【问题描述】：

在 gzip 文件上使用 CombineFileInputFormat 的最佳方法是什么？

【问题讨论】：

标签： hadoop mapreduce gzip hadoop-yarn

【解决方案1】：

This 文章将帮助您在 CombineFIleInputFOrmat 的帮助下构建自己的 Inputformat 来读取和处理 gzip 文件。以下部分将让您了解需要做什么。

自定义输入格式：

构建您自己的自定义 combinefileinputformat 几乎与 combinefileinputformat 相同。键必须是我们自己的可写类，它将保存文件名、偏移量，而值将是实际的文件内容。必须将 issplittable 设置为 false（我们不想拆分文件）。将 maxsplitsize 设置为您要求的值。根据该值，Combinefilerecordreader 决定拆分的数量并为每个拆分创建一个实例。您必须通过向其中添加解压缩逻辑来构建自己的自定义记录阅读器。

自定义 RecordReader：

Custom Recordreader 使用 linereader 并将 key 设置为文件名，offset 和 value 设置为实际文件内容。如果文件被压缩，它会解压缩并读取它。这是它的摘录。

private void codecWiseDecompress(Configuration conf) throws IOException{

         CompressionCodecFactory factory = new CompressionCodecFactory(conf);
         CompressionCodec codec = factory.getCodec(path);

            if (codec == null) {
                System.err.println("No Codec Found For " + path);
                System.exit(1);
            }

            String outputUri = 
CompressionCodecFactory.removeSuffix(path.toString(), 
codec.getDefaultExtension());
            dPath = new Path(outputUri);

            InputStream in = null;
            OutputStream out = null;
            fs = this.path.getFileSystem(conf);

            try {
                in = codec.createInputStream(fs.open(path));
                out = fs.create(dPath);
                IOUtils.copyBytes(in, out, conf);
                } finally {
                    IOUtils.closeStream(in);
                    IOUtils.closeStream(out);
                    rlength = fs.getFileStatus(dPath).getLen();
                }
      }

自定义可写类：

带有文件名、偏移值的一对

【讨论】：

@VigneshI 感谢您的回复。我已经研究过这个选项，它不是最好的，因为它可能会对在 HDFS 上创建的文件产生副作用。有没有更好的方法可以在不创建解压缩的临时文件的情况下做到这一点。
@Vignesh我有没有临时文件的解决方案？