PIG 自定义加载程序的 getNext() 被一次又一次地调用答案

【问题标题】：PIG Custom loader's getNext() is being called again and againPIG 自定义加载程序的 getNext() 被一次又一次地调用
【发布时间】：2014-09-30 05:23:21
【问题描述】：

我已经开始为我们的一个项目使用 Apache Pig。我必须创建一个自定义输入格式来加载我们的数据文件。为此，我遵循了这个例子Hadoop:Custom Input format。我还创建了自定义 RecordReader 实现来读取数据（我们从其他应用程序获取二进制格式的数据）并将其解析为正确的 JSON 格式。

当我在 Pig 脚本中使用我的自定义加载程序时会出现问题。一旦调用了我的加载程序的 getNext() 方法，它就会调用我的自定义 RecordReader 的 nextKeyValue() 方法，该方法工作正常。它正确读取数据，将其传递回我的加载器，该加载器解析数据并返回一个元组。到目前为止一切顺利。

当我的加载程序的 getNext() 方法被一次又一次地调用时，问题就出现了。它被调用，工作正常，并返回正确的输出（我调试它直到返回语句）。但是，我的加载程序并没有让执行更进一步，而是再次被调用。我试图查看我的加载程序被调用的次数，我可以看到这个数字一直到 20K！

有人可以帮我理解我的代码中的问题吗？

加载器

public class SimpleTextLoaderCustomFormat extends LoadFunc {

protected RecordReader in = null;
private byte fieldDel = '\t';
private ArrayList<Object> mProtoTuple = null;
private TupleFactory mTupleFactory = TupleFactory.getInstance();

@Override
public Tuple getNext() throws IOException {
    Tuple t = null;
    try {
        boolean notDone = in.nextKeyValue();
        if (!notDone) {
            return null;
        }
        String value = (String) in.getCurrentValue();
        byte[] buf = value.getBytes();
        int len = value.length();
        int start = 0;

        for (int i = 0; i < len; i++) {
            if (buf[i] == fieldDel) {
                readField(buf, start, i);
                start = i + 1;
            }
        }
        // pick up the last field
        readField(buf, start, len);

        t =  mTupleFactory.newTupleNoCopy(mProtoTuple);
        mProtoTuple = null;

    } catch (InterruptedException e) {
        int errCode = 6018;
        String errMsg = "Error while reading input";
        e.printStackTrace();
        throw new ExecException(errMsg, errCode,
                PigException.REMOTE_ENVIRONMENT, e);
    }
    return t;
}

private void readField(byte[] buf, int start, int end) {
    if (mProtoTuple == null) {
        mProtoTuple = new ArrayList<Object>();
    }

    if (start == end) {
        // NULL value
        mProtoTuple.add(null);
    } else {
        mProtoTuple.add(new DataByteArray(buf, start, end));
    }

}

@Override
public InputFormat getInputFormat() throws IOException {
    //return new TextInputFormat();
    return new CustomStringInputFormat();
}

@Override
public void setLocation(String location, Job job) throws IOException {
    FileInputFormat.setInputPaths(job, location);
}

@Override
public void prepareToRead(RecordReader reader, PigSplit split)
        throws IOException {
    in = reader;
}

自定义输入格式

public class CustomStringInputFormat extends FileInputFormat<String, String> {

    @Override
    public RecordReader<String, String> createRecordReader(InputSplit arg0,
            TaskAttemptContext arg1) throws IOException, InterruptedException {
        return new CustomStringInputRecordReader();
    }

}

自定义 RecordReader

public class CustomStringInputRecordReader extends RecordReader<String, String> {

    private String fileName = null;
    private String data = null;
    private Path file = null;
    private Configuration jc = null;
    private static int count = 0;

    @Override
    public void close() throws IOException {
//      jc = null;
//      file = null;
    }

    @Override
    public String getCurrentKey() throws IOException, InterruptedException {
        return fileName;
    }

    @Override
    public String getCurrentValue() throws IOException, InterruptedException {
        return data;
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
        return 0;
    }

    @Override
    public void initialize(InputSplit genericSplit, TaskAttemptContext context)
            throws IOException, InterruptedException {
        FileSplit split = (FileSplit) genericSplit;
        file = split.getPath();
        jc = context.getConfiguration();
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        InputStream is = FileSystem.get(jc).open(file);
        StringWriter writer = new StringWriter();
        IOUtils.copy(is, writer, "UTF-8");
        data = writer.toString();
        fileName = file.getName();
        writer.close();
        is.close();

        System.out.println("Count : " + ++count);

        return true;
    }

}

【问题讨论】：

嗨 Aakash，我的上述代码输出为空，你能帮我解决吗？
嗯，这很奇怪。可以尝试调试阅读器的 nextKeyValue() 方法吗？
请堆栈跟踪!!!!
@JeffWood，这是一个老问题，我已经以某种方式解决了。虽然我忘了在这里回答，也不记得我是怎么解决的。此外，由于没有抛出任何异常并且执行没有停止，因此无法获得堆栈跟踪。无论如何，我应该把它标记为关闭。
解决方案是什么？

标签： java hadoop apache-pig

【解决方案1】：

在加载器中试试这个

//....

boolean notDone = ((CustomStringInputFormat)in).nextKeyValue();

//...

Text value = new Text(((CustomStringInputFormat))in.getCurrentValue().toString())

【讨论】：